[PDF] Identifying down and up-regulated chromosome regions using RNA-Seq data

Abstract

The number of studies dealing with RNA-Seq data analysis has experienced a fast increase in the past years making this type of gene expression a strong competitor to the DNA microarrays. This paper proposes a Bayesian model to detect down and up-regulated chromosome regions using RNA-Seq data. The methodology is based on a recent work developed to detect up-regulated regions in the context of microarray data. A hidden Markov model is developed by considering a mixture of Gaussian distributions with ordered means in a way that first and last mixture components are supposed to accommodate the under and overexpressed genes, respectively. The model is flexible enough to efficiently deal with the highly irregular spaced configuration of the data by assuming a hierarchical Markov dependence structure. The analysis of four cancer data sets (breast, lung, ovarian and uterus) is presented. Results indicate that the proposed model is selective in determining the regulation status, robust with respect to prior specifications and provides tools for a global or local search of under and overexpressed chromosome regions.

Full PDF

IIdentifying down and up-regulated chromosome regions usingRNA-Seq data.

Vin´ıcius Diniz Mayrink and Fl´avio Bambirra Gon¸calves

Departamento de Estat´ıstica, Universidade Federal de Minas Gerais, Brazil.

Abstract

The number of studies dealing with RNA-Seq data analysis has experienced a fast increase in thepast years making this type of gene expression a strong competitor to the DNA microarrays. This paperproposes a Bayesian model to detect down and up-regulated chromosome regions using RNA-Seq data.The methodology is based on a recent work developed to detect up-regulated regions in the context ofmicroarray data. A hidden Markov model is developed by considering a mixture of Gaussian distributionswith ordered means in a way that ﬁrst and last mixture components are supposed to accommodate theunder and overexpressed genes, respectively. The model is ﬂexible enough to eﬃciently deal with thehighly irregular spaced conﬁguration of the data by assuming a hierarchical Markov dependence structure.The analysis of four cancer data sets (breast, lung, ovarian and uterus) is presented. Results indicatethat the proposed model is selective in determining the regulation status, robust with respect to priorspeciﬁcations and provides tools for a global or local search of under and overexpressed chromosomeregions.keywords: gene expression, mixture model, cancer, Gibbs sampling.

Methods based on next-generation sequencing (NGS) to study the genome have ﬂourished in the past yearshelping to understand changes in the transcriptome and leading to the discovery of new mutations andfusion genes. The identiﬁcation of genes driving the cancer progression is, for example, the focus of extensiveresearch using this technology (Maher et al., 2009; Berger et al., 2010; Han et al., 2011). In this paper, we areparticularly interested in the analysis of RNA-Seq (RNA sequencing) data quantifying the amount of RNAin a biological sample at a given moment in time (Wang et al., 2009; Oshlack et al., 2010; Chu and Corey,2012; Conesa et al., 2016). The RNA-Seq data can be used in diﬀerent types of investigations including,for example, modiﬁcations in gene expression over time (Nueda et al., 2014; Van-De-Wiel et al., 2013) and(more often) a diﬀerential expression analysis comparing distinct conditions, groups or tissue types (Andersand Huber, 2010; Bullard et al., 2010; Robinson et al., 2010; McCarthy et al., 2012; Soneson and Delorenzi,2013; Zhang et al., 2015; Papastamoulis and Rattray, 2017).Prior to the advent of the NGS, the DNA microarrays (Amaratunga et al., 2014) were the major sourceof gene expression data explored in the literature. The ﬁrst experiments to obtain RNA-Seq data were highlyexpensive compared to the microarray technology; this obstacle restricted the experimental design in manystudies imposing analyses with few replications. The focus on the microarray technology is evident in thediscussion presented in Horvath and Baur (2000) (back in 2000) about the perspectives of future directionsin statistical genetics. As the NGS became aﬀordable along the years, the number of studies based on RNA-Seq data increased rapidly leading to a scenario where it is now considered a powerful competitor to themicroarrays. Studies confronting “RNA-Seq versus microarrays” can be easily found in the literature; see,for example, Marioni et al. (2008) and Nookaew et al. (2012). The higher interest in RNA-Seq can also beexplained by its advantages over the microarrays: lower background noise, the ability to detect low frequencyand novel transcripts from genome regions not previously annotated, the requirement for less RNA samplesto generate the data.The work developed here is heavily based on the recent study reported in Mayrink and Gon¸calves (2017),where the authors propose a Bayesian Markov mixture model to detect overexpressed regions on the chromo-somes using microarray data. The model basically consists of a mixture of some gammas and one Gaussian

Address for correspondence:

Vin´ıcius Mayrink, Departamento de Estat´ıstica, ICEx, UFMG. Av. Antˆonio Carlos,6627, Belo Horizonte, MG, Brazil, 31270-901. E-mail: [email protected] a r X i v : . [ s t a t . A P ] A ug istributions in which the latter is the mixture component with the highest mean and is suppose to accom-modate the overexpressed regions. The authors use the known expected similarity among near locations asan important source of information by imposing a Markov structure to the expression measurements alongthe chromosome. Moreover, given the highly irregular distances between the observed locations, modellingand computational diﬃculties are overcome by a hierarchically structured Markov dependence as a func-tion of the distances. Finally, the analysis is entirely focused on Aﬀymetrix oligonucleotide arrays with thepreprocessed (RMA) probe intensities being mapped to chromosome locations via the alignment tool BLAT.Another interesting study dealing with the detection and classiﬁcation of chromosome regions accordingto the status of the expressions from RNA-Seq can be found in Frazee et al. (2014). The main goal is thecomparison between groups (e.g.: male/female or control/treatment) to evaluate the existence of expressiondiﬀerences. The authors ﬁt a linear regression model (under the frequentist approach) to each alignedlocation in the genome, with the response being the number of sequences overlapping each location. Thesigniﬁcance of coeﬃcients (location speciﬁc and related to group indicators) is used to segment the genomeinto contiguous regions labeled as over, under or non-diﬀerentially expressed. The statistics to test thesigniﬁcance of such coeﬃcients are determined conditionally on an underlying ﬁrst-order Markov processwith three states.The approaches from Mayrink and Gon¸calves (2017) and Frazee et al. (2014) diﬀer in some criticalaspects. In terms of modelling, Frazee et al. (2014) does not consider the distance between locations as asource of information. In addition, the target application in Frazee et al. (2014) is to compare groups andidentify genomic regions exhibiting diﬀerences among those groups. The methodology proposed in Mayrinkand Gon¸calves (2017), on the other hand, aims at detecting high expression regions within a single group(e.g.: a tumor type) without having a reference group to evaluate distinctions. The overexpressed statusin Mayrink and Gon¸calves (2017) is meant to call the attention for speciﬁc parts of the genome where thegenes may be up-regulated and playing leading roles for the progression of the disease under investigation.The main contribution of the present paper is to extend (considering reasonable adaptations) the method-ology proposed in Mayrink and Gon¸calves (2017) for applications involving RNA-Seq data. As it will bediscussed later, unlike the microarray data from Mayrink and Gon¸calves (2017), the preprocessed observa-tions considered here present a symmetric empirical behaviour. For that reason, the original gamma mixturecomponents are replaced by Gaussian ones. Moreover, the present work considers not only the detectionof overexpressed chromosome regions but also the detection of underexpressed ones. Finally, the proposedmethodology is applied to four datasets, each of them concerning a diﬀerent type of cancer (breast, lung,ovarian and uterus).The outline of this paper is as follows. Section 2 describes the RNA-Seq data to be analysed, includingall the preprocessing steps to organise, rescale and remove part of the noise in the raw data. Section 3presents the proposed mixture model while Section 4 describes the details of the Bayesian inference procedure,including the MCMC algorithm and the cluster detection strategy. Section 5 shows the analysis of the fourcancer data sets to illustrate the methodology. Finally, Section 6 presents the main conclusions and ﬁnalremarks. The RNA-Seq data, explored in this study, was obtained from the Genomic Data Commons (GDC) websitemaintained by the U.S. National Cancer Institute (NCI). According to their own description, the GDC dataportal is a robust data-driven platform providing an uniﬁed repository that enables data sharing acrosscancer genomic studies; see https://portal.gdc.cancer.gov for more information. The data collectedfrom the GDC portal represent four types of cancer: breast invasive carcinoma, lung adenocarcinoma, ovarianserous cystadenocarcinoma and uterine corpus endometrial carcinoma. Hereafter, these types of tumors willbe denoted by: Breast (45 samples), Lung (56 samples), Ovarian (72 samples) and Uterus (48 samples),respectively. All samples are related to female human subjects with the age at diagnosis ranging between40–90 years old; the death occurred within the period of 2 years since the diagnosis. Given that the datarefers only to female subjects, the chromosome Y is not included in our analysis.Each of the samples is composed by thousands of measurements related to nucleotide sequences (genes)in the genome. The main steps to produce the data are: ( i ) the RNA’s are isolated from a tissue andconverted to cDNA fragments, ( ii ) a high-throughput sequencer generates millions of reads (short nucleotide2equences) from the cDNA fragments, ( iii ) an alignment tool is used to map the reads to the human genomeand ( iv ) the number of reads mapped to a gene is regarded as the raw expression value.The read mapping uncertainty can be seen as an experimental challenge in the process to build this type ofdata. The reads are shorter than the transcripts from which they are originated, thus a single read may mapto multiple genes, which is certainly a factor complicating the analysis. There are many approaches proposedin the literature to deal with this problem; some examples are found in Marioni et al. (2008), Faulkner et al.(2008), Mortazavi et al. (2008), Li et al. (2010), Garber et al. (2011) and Fonseca et al. (2012). The mappinguncertainty issue is not the focus of the present paper; the data sets extracted from the GDC portal havealready been aligned to the human reference genome. For each sample, the data set contains: the geneidentiﬁcation (according the ensembl annotation, see ) and the number of readsmapped to that gene. Using the software R (R Core Team, 2017) and the package Genominator (Bullardet al., 2010) from the Bioconductor project ( http://bioconductor.org , see Gentleman et al. (2004)), onecan easily retrieve the chromosome name and the start/end position of the gene within the chromosome bycomparing the gene annotations (reported in the data sets) with the current version of the human genome(labeled as “ hsapiens gene ensembl ” in the R package). In particular, the start/end position of a geneprovides two important quantities for our study; their average is assumed as the location of the gene andtheir diﬀerence indicates the length of the gene. Very few inconsistencies were found in the data - basicallylocations with multiple expressions; these cases were deleted from the analysis without signiﬁcant loss. Atthis point of the data description, each sample (for any cancer) comprises 57 ,

528 genesThe raw expression measurements in a RNA-Seq data set must undergo preprocessing steps for nor-malisation before any statistical analysis. Normalisation is essential in gene expression studies using eithermicroarray (Mayrink and Lucas, 2015; Irizarry et al., 2003) or RNA-Seq data (Robinson et al., 2010; Bullardet al., 2010; Hansen et al., 2012; Dillies et al., 2012). In the RNA-Seq case, the procedure is applied toremove systematic variations related to between-sample diﬀerences (i.e.: library size or sequencing depth,see Mortazavi et al. (2008)) and within-sample eﬀects (gene lengths, see Oshlack and Wakeﬁeld (2010)).More speciﬁcally, diﬀerent samples may have diﬀerent library sizes; i.e., their total number of mapped readsmay diﬀer. The bigger the library size, the larger are the counts observed for the genes; this implies in higherbiological variability across the samples. In addition, it is well known that the genes have diﬀerent lengthsand, as a consequence, more reads tend to be mapped to the longer ones, misleading their expression withrespect to the others.Using the R package edgeR (Robinson et al., 2010) as a support, the following normalisation steps havebeen applied to the raw RNA-Seq data in this paper.1. Filtering: If a gene has 0 counts for the majority of the samples, then it is not clear whether thatgene is indeed unimportant or if it could not be properly detected by the sequencer. In line with thiscomment, genes with non-zero read counts in less than 10 samples (for each cancer) were removed fromthe analysis.2. Resetting the library size: given the removal in the previous step, the total read count per sample isupdated.3. Computing the normalising factor: a value is obtained for each sample to scale the updated librarysizes. The method Trimmed Mean of M-values (TMM) in Robinson and Oshlack (2010) is consideredhere.4. Normalising between- and within-samples: the method RPKM (reads per kilobase per million, imple-mented in edgeR ) is applied taking as input arguments the normalising factor (previous step) and thegene lengths. The preprocessed expressions are reported in the log scale.It is important to emphasise that the group of genes removed in the ﬁltering step of this normalisationis not the same for all cancers. As a result, the list and number of genes diﬀer between the normaliseddata sets; we have: 37 ,

752 genes (Breast), 39 ,

397 (Lung), 44 ,

293 (Ovarian) and 38 ,

739 (Uterus). The totalnumber of genes being evaluated in at least one cancer is 45 ,

543 (77 .

3% of them are found in all four datasets). Table 1 shows percentages indicating the level of intersections between the list of genes of two datasets. Note that the lowest percentage is 83 . Breast Lung Ovarian UterusBreast 100 90.6 83.6 87.8Lung 100 86.1 88.1Ovarian 100 85.3Uterus 100(a) (b)

Figure 1: Graphs related to the data set “Breast”. Panel (a): Histogram displaying the distribution of themedians of the preprocessed log expressions (chromosomes 1 −

22 and X ). Panel (b): Points showing thepositions of the medians along chromosome 1.Figure 2: Image displaying all p-values of the Moran’s I permutation test applied to all data sets (summarisedby the median); cancers in each row and chromosomes (1 −

22 and X ) in each column.In any cancer, the histograms for each preprocessed sample (i.e., without the median summarisation) in-dicate exactly the same symmetric shape shown in Figure 1 with support also condensed between ( − , R package spdep Bivandand Piras (2015) is considered here to apply the test. As it can be seen, several p-values are small (inwhite) suggesting that the spatial association is signiﬁcant in many cases. In addition, some large p-values(in grey) are also observed suggesting absence of spatial association. This combination of results deﬁnitelymotivates a modelling strategy allowing for the presence or absence of spatial dependence and accountingfor the distances between neighbours to explain this dependence.Note that the magnitude of the original distances, indicated in Figure 1 (b), is too large and this mightdetermine computational diﬃculties to ﬁt a model. This concern is also reported in Mayrink and Gon¸calves(2017) and the paper handles this problem by rescaling the distances to the interval (0 ,

1) using two steps:calculate the log-distances and divide each of them by the maximum log-distance. We will apply the sametransformation in our study. As discussed in Mayrink and Gon¸calves (2017), taking the log preserves highdiﬀerences among the smallest values and practically even out the highest values; this is coherent withthe model proposed in the next section, which assumes the distances as covariates to explain the Markovdependence.Figure 3: Histograms of all rescaled distances between locations in all chromosomes.Figure 3 presents the histograms for all rescaled distances in the chromosomes. Note that most valuesare above 0 . , For a cancer data set, let n be the number of genes aligned to the genome (each gene has a unique location).Deﬁne the data as X = { X i , i = 1 , . . . , n } where X i is the median of the preprocessed expressions in location i that were not originally zero. Again, the summarisation through the median is justiﬁed by symmetry,unimodality and the number of replications available for this calculation.5he main purpose of this study is to detect underexpressed (down-regulated) and overexpressed (up-regulated) regions in the human genome for each selected cancer. Recall that Mayrink and Gon¸calves (2017)models only the overexpressed regions; the down-regulated cases are left aside in the analysis due to thestrong right-skewed behaviour of the data. The underexpressed status of a gene is also a valuable piece ofinformation that could help to explain key details about a disease. As shown in Figure 1 (a), the preprocessedRNA-Seq data is fairly symmetric making both tails of the histogram equally important to be modelled - theleft tail being connected to the down-regulation. It is not diﬃcult to ﬁnd in the literature studies evaluatingboth down and up-regulated status of a gene based on high-throughput sequencing data; a few examples areHan et al. (2011), Trapnell et al. (2013) and Law et al. (2014).The naive strategy of ﬁxing two arbitrary thresholds and identifying the gene expressions (in the tails)below/above these values is discarded in our study for the same reasons discussed in Mayrink and Gon¸calves(2017): ( i ) there is no explicit gap in the distribution of the expressions in each chromosome, ( ii ) there isno clear scale of the measured expression due to the normalisation steps and ( iii ) this attempt would ignorethe dependence known to exist among near genes (e.g.: see DNA copy number alteration analysis in Pollacket al. (2002) and Lucas et al. (2010)).Our model considers the dependence among near genes to ﬁnd the chromosome regions of interest. Anunderexpressed (overexpressed) region is seen as a sequence of genes showing low (high) expressions in thechromosome. Such regions will be associated with the components having the smallest and largest meanin a mixture distribution; no ﬁxed thresholds are speciﬁed here. The dependence structure is establishedby assuming that near genes are likely to be in the same mixture component. This structure also disfavorsthe detection of spikes, i.e., an isolated diﬀerentially expressed gene whose neighbours are not down orup-regulated as well. In order to achieve this goal, informative prior distributions are required for someparameters in the model.Unlike Mayrink and Gon¸calves (2017) that considers a mixture of gamma distributions to eﬃcientlyaccount for the signiﬁcant skewness of the microarray data explored in that work, this paper considers amixture of Gaussian distributions due to the fairly symmetric behaviour of the RNA-Seq to be analysed.This change also simpliﬁes the MCMC algorithm developed to perform inference.Following the steps of Mayrink and Gon¸calves (2017), we deﬁne a hierarchical Markov structure to modelthe dependence among the genes. On the ﬁrst level, the Markov dependence is assumed to be the same forany consecutive pair of genes aligned to the genome, however, this dependence is present with a probabilitywhich is modelled as a function of the distances on the second level. The Markov dependence is essentialto handle the fact that it is unlikely that a single location with an expression much lower/higher thanits neighbours is indeed representing a gene from a diﬀerentially expressed region. The up/down spikedobservations suggest a local atypical expression and they are expected to be captured by the left or righttail of the adjacent Gaussian component in the mixture. The cases where an isolated spike may belong to apotential diﬀerentially expressed region are those where their neighbours are not too close.Let K be the number of normal components in the mixture distribution. Note that K ≥ q = ( q , . . . , q K ) (cid:48) and the transition matrix Q = { q k k } of a K states and discrete time Markov chain; q k is the k -th row of Q and consider k , k , k = 1 , . . . , K . The k -th Gaussian component has mean µ k , variance σ k and its density will be represented by f k . In addition,let Z i = ( Z i, , . . . , Z i,K ) (cid:48) be a vector such that Z i,k = 1 indicates that X i belongs to the k -th mixturecomponent, 0 otherwise. The following model is proposed:( X i | Z i,k = 1) ∼ N( µ k , σ k ) , i = 1 , . . . , n, all independent; (1)( Z | q ) ∼ Mult(1 , q ); (2)( Z i | Z i − ,k = 1 , ρ i , q , Q ) ∼ ρ i Mult(1 , q ) + (1 − ρ i ) Mult(1 , q k ) , i = 2 , . . . , n. (3)The parameters indexing the mixture components are suppressed from the conditional notation above. Theterm “Mult” refers to the multinomial distribution. The distribution in (3) implies that a Markov dependencebetween the expressions in the locations i − i exists with probability ρ i .As opposed to an ordinary mixture model, the X i variables are not marginally (w.r.t. Z ) independent in(1)-(3). The latent level (3) deﬁnes the dependence through a Markov structure. For computational reasons,the Bernoulli random variables W i , i = 2 , . . . , n , are introduced in the model as indicators of “presence of6arkov dependence”. In this case, consider for i = 2 , . . . , n :( Z i | Z i − ,k = 1 , W i = 0 , q ) ∼ Mult(1 , q ); (4)( Z i | Z i − ,k = 1 , W i = 1 , Q ) ∼ Mult(1 , q k ); (5)( W i | ρ i ) ∼ Ber( ρ i ) . (6)Assume that W = 0 almost surely and ( Z | Z , W = 0 , q ) := ( Z | q ).The full speciﬁcation of the Bayesian model is completed with the prior distributions: q ∼ Dir( r ); q k ∼ Dir( r k ); all independent;( µ k , σ k ) ∼ NIG( m k , v k , s k , s k ) , µ < µ < . . . < µ K ; (7)( ρ i | β ) = Φ( β + β d i ); β = ( β , β ) (cid:48) ∼ N ( µ , Σ ) . In the speciﬁcations above, Φ( . ) is the c.d.f. of the standard normal distribution and d i is the (transformed)distance between locations i − i . The terms Dir, NIG and N refer to the Dirichlet, normal-inverse-gamma and bivariate normal distributions, respectively. Additional notation: Z = { Z i , i = 1 , . . . , n } , W = { W i , i = 1 , . . . , n } , µ = { µ k , k = 1 , . . . , K } and σ = { σ k , k = 1 , . . . , K } .Note that the distance d i is used to stochastically explain the Markov dependence through the probitregression on W i . Vector q could be seen as the stationary distribution of the Markov chain with transitionmatrix Q ; however, this restriction is not imposed in our model, since it would complicate the inferencemethodology.The lower ( k = 1) and upper ( k = K ) Gaussian components of the mixture are expected to accommodatethe expressions of the genes more likely to form the under and overexpressed regions of interest, respectively.Since the percentages of genes in those regions are expected to be small, one can elicit informative priordistributions to insert this idea; for example, priors suggesting small values for { q , , q , , . . . , q K, } and { q ,K , q ,K , . . . , q K,K } . In this section, we present an MCMC algorithm to sample from the high dimensional and complex jointposterior distribution of all unknown quantities deﬁned in the proposed model. The implementation wasdeveloped using the R programming language (R Core Team, 2017). Unlike the MCMC algorithm fromMayrink and Gon¸calves (2017), no Metropolis-Hastings steps are required in our Gibbs sampler due to theuse of only Gaussian mixture components. The algorithm is a Gibbs sampling with blocking and sampling schemes chosen in a way to favor fastconvergence of the chain. In this direction, consider a set of independent auxiliary variables V := { V i , i =2 , . . . , n } to allow for direct sampling from the full conditional distribution of β (see Albert (1992)). Let (cid:126)d i = (1 , d i ) (cid:48) and deﬁne: ( V i | β ) ∼ N ( β (cid:48) (cid:126)d i ,

1) and W i = (cid:26) , if V i > , if V i ≤ . The original model for X is preserved here; this can be veriﬁed by simply integrating V out.Let ψ := { µ, σ } and consider the following blocking scheme for the Gibbs sampler: ( Z, W ), V ,( q , Q, β, ψ ). In order to deﬁne an irreducible Markov chain, we consider a collapsed Gibbs sampling (seeLiu (1994)) with V being integrated out from the full conditional distribution of ( Z, W ).All full conditional densities are proportional to the joint density of X and the unknown quantities inthe model - this density is given in Appendix A (A.1). The next items describe each step of the algorithm.7 ampling V : The V i ’s are all independent with distribution N( β (cid:48) (cid:126)d i , W i = 1 and non-positive if W i = 0. Sampling ( q , Q, β, ψ ) : The four elements in this block - q , Q , β and ψ , are conditionally independent; therefore, each of themcan be sampled individually from the corresponding marginal full conditional distribution. The marginalsof the ﬁrst three components are given in Appendix A (A.2)-(A.4). The marginal full conditional of ψ is atruncation of the tractable distribution shown in Appendix A (A.5); the truncation region is deﬁned by therestriction { µ < µ < . . . < µ K } . Using the rejection sampling, one can sample exactly from this truncateddistribution by proposing from its non-truncated version and accepting if the restriction is preserved. Since K is chosen to be small, the algorithm is computationally eﬃcient. Sampling ( Z, W ) : This step of the algorithm was developed in Mayrink and Gon¸calves (2017) and consists of a backward-ﬁltering-forward-sampling (BFFS) scheme. Note that integrating out V provides the following full conditionalkernel: π ( Z, W |· ) ∝ n (cid:89) i =1 (cid:34) K (cid:89) k =1 (cid:2) ( f k ( X i | ψ ) q k ) Z ik (cid:3) − W i (cid:2) ( f k ( X i | ψ ) q k ( i − k ) Z ik (cid:3) W i (cid:35) × (cid:0) Φ + i (cid:1) W i (cid:0) Φ − i (cid:1) − W i K (cid:89) k =1 q Z k k , (8)with Φ + i = Φ( β (cid:48) (cid:126)d i ) and Φ − i = Φ( − β (cid:48) (cid:126)d i ).Direct sampling from (8) is possible if we assume the factorisation: π ( Z, W |· ) ∝ π ( Z |· ) n (cid:89) i =2 π ( Z i | W i , Z i − , · ) π ( W i | Z i − , · ) . This implies that (

Z, W ) must be sampled in the forward direction, i.e., according to the order: Z , W , Z , . . . , W n , Z n . The marginal distributions are multinomial (Bernoulli for W i ) and their parameter values areobtained recursively (backwards from n to 1). Further details about the BFFS are given in Appendix A. As explained in Section 3, the genes forming a cluster of down and up-regulation are supposed to be in-corporated by the lower and upper Gaussian components of the mixture, respectively. Hence, the clusterdetection ought to be done based on the posterior probability that one or more genes are accommodated bythe (lower or upper) Gaussian component.Given a sample of size M from the posterior distribution of Z and a sequence { i , i , . . . , i S ; S = 1 , , . . . } of expressions, the posterior probability of this sequence being a down-regulated or an up-regulated clusterof genes are, respectively:1 M M (cid:88) m =1 S (cid:89) s =1 ( Z ( m ) i s , = 1) and 1 M M (cid:88) m =1 S (cid:89) s =1 ( Z ( m ) i s ,K = 1); (9)where ( . ) is the indicator function.In another possible criterion for cluster detection, one may ﬁnd all genes with probability greater than athreshold (for example 0 .

5) of belonging to a (lower or upper) Gaussian component; then, search which ofthem form contiguous sequences of a minimum size (say 4 or 5).

The analysis reported in this section is related to four RNA-Seq data sets representing diﬀerent typesof cancers; see details in Section 2. Since the scale of the preprocessed expressions is the same for all8hromosomes, the analysis is developed by organising the chromosomes 1 −

22 and X side by side (for eachdata set) to form the genomic sequence. Note that there exists no distance to be measured between the lastand ﬁrst locations of consecutive chromosomes, this means that we have no reason to assume a dependencebetween the gene expressions in these extremities. This aspect is incorporated in the model by ﬁxing W i = 0for the ﬁrst gene location in each chromosome. This assumption is computationally convenient since theBFFS procedure can be executed separately (therefore in parallel) for each chromosome due to conditionalindependence.Choosing the number of Gaussian components in the mixture is certainly a critical step in our analysis.As discussed in Section 3, K is at least 3; however, assuming K small leads to a model ﬁt where thelower/upper Gaussian components are too wide (with high variance) and include too many expressions. Inorder to explore other reasonable choices, the models with K = 3, 4, 5, 6, 7, 8 and 9 are ﬁtted to the Breastand Ovarian data sets. Results are presented in Appendix B and they indicate that K = 8 is a suitableglobal choice showing the same behaviour between data sets and with respect to K = 9.In terms of prior speciﬁcation, we set r = (32 . , . , . , . , . , . , . , . (cid:48) , whichindicates that an expression at a location without the Markov dependence most likely belongs to an internal(not the lower or upper) Gaussian component in the mixture; note that r = r = r k /

10 for any k ∈{ , . . . , } . The sum of these weights is 2 ,

000 and smaller concentration parameters would determine weaklyinformative priors w.r.t. the data involving thousands of genes. The following matrix shows the priorspeciﬁcation for q k . r =  .

12 606 .

06 121 .

21 12 .

12 12 .

12 12 . .

12 1212 .

12 606 .

06 121 .

21 12 .

12 12 .

12 12 . .

35 467 .

29 934 .

58 467 .

29 93 .

46 9 .

35 9 .

35 9 . .

97 89 .

69 448 .

43 896 .

86 448 .

43 89 .

69 8 .

97 8 . .

97 8 .

97 89 .

69 448 .

43 896 .

86 448 .

43 89 .

69 8 . .

35 9 .

35 93 .

46 467 .

29 934 .

58 467 .

29 9 . .

12 12 .

12 121 .

21 606 .

06 1212 .

12 12 . .

12 12 .

12 121 .

21 606 .

06 1212 .  . The k -th row of r contains the parameter vector for the Dirichlet prior of q k . Note that the sum of any rowis 2 ,

000 and the largest weights are deﬁned in the main diagonal (the r kk entries). This conﬁguration favorsthe allocation of the i -th observation in the same Gaussian component of the neighbour expression fromlocation i −

1. In addition, the weight in each row decreases when moving away from the main diagonal; inany row, the second and third largest weights are given by r kk / r kk /

10, respectively. This speciﬁcationdiscourages the model to assign two consecutive observations in distant components. Small weights arealso speciﬁed for the lower/upper Gaussian components in columns 1 and 8 (except for the main diagonalelements); we have r k = r k = r kk / k = 2 , . . . ,

7. This choice indicates that the lower/uppercomponents are accessible only to those genes with great evidence of diﬀerential expression and havingneighbours with the same characteristic. We emphasise that r and r can be easily adapted to other valuesof K , such as those explored in Appendix B. All adaptations have the same discussed features: arrangementof small/large values, magnitude of weights (largest one divided by 2, 10 or 100) and the sum of weightsbeing 2 , β , β ) control the impact of the distance d i over the probability of Markov dependence ρ i . The analysis in Mayrink and Gon¸calves (2017) estimates those parameters to be around (5 . − . d i and ρ i . Therefore, they ought to be used to elicit an informative prior. In line withthat, we set ( β , β ) (cid:48) ∼ N [(5 . , − . (cid:48) , . I ], which indicates: lim d i → ρ i ≈

1, lim d i → ρ i ≈ ρ i = 0 . d i ≈ . µ k , σ k ), we choose the mean m k based on the scale of the ob-servations and the role expected for the Gaussian components in the model. In this case, let v k = 10 forall k and m k = − − . − . − .

43, 1 .

43, 4 .

29, 7 .

14 and 10 for k = 1 , . . . ,

8, respectively. We alsoset s k = 2 . s k = 1 . k , meaning that E ( σ k ) = 1 and Var( σ k ) = 10. A sensitivity analysiswas performed by increasing the prior variance of ( µ k , σ k ) and the very same results were obtained – seeAppendix B. 9or each data set, the MCMC chain runs for 10 ,

000 iterations with a burn-in of 2 , q (0)0 = 0 . (8 × and Q (0) = 0 . (8 × ; where ( l × l ) is a l × l matrix of ones. Inaddition, β (0)0 = 5 .

05 and β (0)1 = − .

55. The starting values of µ k and σ k are set based on descriptivestatistics exploring the data. The support of the histogram in Figure 1 (a) is broken into 8 contiguousintervals and their means and variances are used to initialise the MCMC; Table 2 shows the starting valuesfor each data set. Convergence is rapidly attained – Appendix C shows some diagnostics.Table 2: Initial values in the MCMC for µ k and σ k ( K = 8 mixture components). k µ k -13.85 -10.64 -7.43 -4.23 -1.02 2.19 5.40 8.60 σ k µ k -13.40 -10.11 -6.83 -3.54 -0.26 3.03 6.31 9.59 σ k µ k -14.12 -10.82 -7.52 -4.22 -0.93 2.37 5.67 8.97 σ k µ k -12.88 -9.77 -6.65 -3.54 -0.42 2.69 5.81 8.92 σ k Table 3: Posterior mean and standard deviation (in parentheses) for the parameters in the mixture.

Breast Lung Ovarian Uterus µ -11.168 (0.162) -11.019 (0.141) -11.157 (0.130) -10.426 (0.138) σ µ -7.866 (0.100) -7.707 (0.096) -7.873 (0.088) -7.244 (0.100) σ µ -4.765 (0.080) -4.569 (0.098) -5.130 (0.089) -3.974 (0.080) σ µ -3.534 (0.314) -3.375 (0.271) -3.664 (0.075) -3.652 (0.193) σ µ -2.441 (0.150) -2.293 (0.141) -1.853 (0.049) -2.152 (0.122) σ µ -0.167 (0.180) -0.049 (0.111) -0.058 (0.082) -0.064 (0.191) σ µ σ µ σ Table 4: Posterior weights of the mixture components. k Table 3 presents the posterior estimates of ( µ k , σ k ) for all mixture components. Note that the values ofthe mean and standard deviation are similar across the diﬀerent data sets, implying similar mixture densities.Figure 4 conﬁrms this visual interpretation by showing the data histograms overlaid by the estimated mixturedensity and their components. Note that the range of the lower/upper Gaussians are not too wide; therefore,they accommodate well the expressions in the left/right tails of the histograms. According to this result,10he mixture with K = 8 components seems a good representation of the data. Table 4 shows the posteriorweight of each k in each cancer; this value is the posterior mean of (cid:80) ni =1 Z ik /n .Figure 4: Histogram of all expressions overlaid by the estimated mixture density (dashed curve) and itscomponents (grey and black curves).Figure 5: Expression values along chromosome 1. The triangles indicate observations where the Gaussiancomponent has posterior probability above 0 . > .

5) of belonging to thelower/upper Gaussian components, are concentrated on the bottom and top of the graphs. For chromosome1, the lower Gaussian component clearly accommodates more expressions than the upper case. Note thatinternal Gaussian observations (grey dots) can also be identiﬁed in the bottom/top level. These expressionsare allocated in adjacent Gaussian components, which is a consequence of the model structure accountingfor distances between genes to establish the strength of the Markov dependence.Some interesting similarities among the data sets can be observed in Figure 5; there are regions inchromosome 1 with the same under or overexpressed status in two data sets. In the analysis of the wholegenomic sequence (45 ,

543 locations in 23 chromosomes), the number of genes allocated with high probabilityin the lower and upper Gaussian components are, respectively: 722 and 31 (Breast), 717 and 54 (Lung), 933and 74 (Ovarian) and 877 and 46 (Uterus). Table 5 cross-compares data sets and indicates the proportionof locations identiﬁed in the same Gaussian component (lower or upper) in both cancers, relative to thenumber of locations identiﬁed in that Gaussian component in at least one of the two cancers; this analysis11s restricted to those genes represented in all four data sets. Note that the largest percentages are obtainedin the Lung/Uterus comparison for both down and up-regulation status. Most percentages are 0 in theupper Gaussian case, since the number of overexpressed genes identiﬁed by the model is small. On the otherhand, the smallest percentage found in the lower Gaussian case is 1 .

40% representing the Breast/Ovariancomparison.Figure 6 presents grey scale heat maps to investigate the posterior probabilities of down and up-regulationin panels (a, c) and (b, d), respectively. Results for all 23 chromosomes are shown in panels (a, b) and onlychromosome 1 is explored in panels (c, d). Panel (d) shows that the Lung and Uterus cancers tend to providecoherent results in chromosome 1.Table 5: Cross-comparison of data sets. Number of locations identiﬁed in the same Gaussian component(lower or upper) in both data sets, relative to the number of locations identiﬁed in that Gaussian componentin at least one of the data sets. The results are presented as percentages. lower Gaussian upper GaussianBreast Lung Ovarian Uterus Breast Lung Ovarian UterusBreast 100 1.76 1.40 1.77 100 0 0 0Lung 100 1.72 3.50 100 0 1.45Ovarian 100 2.00 100 0Uterus 100 100(a) (b)(c) (d)

Figure 6: Heat map image indicating, for each location, the posterior probability of belonging to a Gaussiancomponent. Conﬁguration of panels: lower Gaussian in column 1 (a, c), upper Gaussian in column 2 (b, d),all chromosomes in row 1 (a, b) and chromosome 1 in row 2 (c, d).If we consider the strategy discussed in Section 4.2 for cluster detection, i.e. ﬁnding consecutive genes withhigh probability ( > .

5) of down or up-regulation status, the following range of ﬁve locations are examplesof under and overexpressed clusters, respectively: 6 , − ,

242 and 2 , − ,

359 (Breast), 4 , − ,

359 and39 , − ,

359 (Lung), 10 , − ,

683 and 2 , − ,

703 (Ovarian) and 4 , − ,

293 and 37 , − , .

5. The onlyexception is the conﬁguration “isolated overexpressed gene in the breast cancer data”; however, the smallbars here indicate very low frequency (breast data has the smallest genomic sequence) and they are clearlyclose to 0 .

5. The overall behaviour exhibited in these graphs suggests that although the magnitude of anisolated expression is compatible with a down/up-regulated cluster, the model cannot use the neighboursto change this status, because they are too far away. These genes might belong to potential clusters, but aconﬁrmation of this relies on their non-observed neighbours.Figure 7: Histogram of the distances between the location of a cluster, with a single observation, and itsnearest internal Gaussian neighbours; down-regulation in row 1 and up-regulation in row 2.Figure 8: Estimated relationship between the distances and ρ i (probability of having a Markov dependence).Result corresponding to the Breast data; the posterior means 5 .

108 ( β ) and − .

513 ( β ) were used to drawthe curve.Figure 8 presents the estimated relationship between d i and the probability ρ i of a Markov dependenceat location i . The posterior means of ( β , β ) for each data set are, respectively: 5 .

108 and − .

513 (Breast),5 .

115 and − .

509 (Lung), 5 .

149 and − .

486 (Ovarian) and 5 .

121 and − .

504 (Uterus). These are all similar,which is expected since we have assumed an informative prior for these coeﬃcients. The decreasing rateof the curve is an important feature to explain the impact of d i on ρ i ; for example, the slow decay beforedistance 0 . d i < .

4. In addition, the curve has a fastdecay between distances 0 . .

6, and it is almost 0 for d i > .

8. This decreasing rate resembles the oneobserved in Mayrink and Gon¸calves (2017) due to our informative prior speciﬁcation.13

Conclusions

This paper proposes a statistical methodology to identify chromosomal regions associated with low/high geneexpression measurements, which we call under/overexpressed or down/up-regulated regions, respectively,based on RNA-Seq data. Given the alignment of read counts to gene locations in the human chromosomes,the data is conﬁgured as an irregularly spaced sequence along the genome. Four data sets, representingbreast, lung, ovarian and uterus cancer, are considered in the study; preprocessing and normalising stepswere applied to the raw read counts before the statistical analyses. The proposed methodology is basedon the work from Mayrink and Gon¸calves (2017) with two main diﬀerences: RNA-Seq data is considered(instead of microarray) and both under and overexpressed chromosome regions are identiﬁed (not only the“over”).The proposed model is based on a mixture of Gaussians, such that the mixture components are supposedto classify the observations in three main categories; the lower Gaussian (smallest mean) accommodates thedown-regulated expressions, the upper Gaussian (largest mean) incorporates the up-regulated expressions andall intermediate Gaussian components are related to the non-diﬀerentially expressed genes. The model takesadvantage of the distances between gene locations to explain the Markov dependence between neighbours.This dependence is used to stochastically deﬁne the clusters of interest. The Bayesian inference is performedthrough an MCMC algorithm consisting of a Gibbs sampling.The main results indicate that the model is able to recognise locations belonging to under and over-expressed regions. This identiﬁcation is not arbitrary since the whole model structure, including priorspeciﬁcations, has contributed to a selective judgment and clear discernment of the gene statuses throughposterior probabilities. On average, only 2 .

02% and 0 .

12% of the genes reached a probability higher then 0 . Acknowledgements

The authors would like to thank Funda¸c˜ao de Amparo `a Pesquisa do Estado de Minas Gerais (FAPEMIG)for supporting this research.

Appendix A: Joint and full conditional distributions

The joint density of X and all the unknown components in the model is: π ( X, Z, W, V, q , Q, ψ, β ) = (cid:34) n (cid:89) i =1 π ( X i | Z i , ψ ) π ( Z i | Z i − , W i , q , Q ) π ( W i | V i ) π ( V i | β ) (cid:35) π ( q , Q, ψ, β ) , = (cid:34) n (cid:89) i =1 (cid:34) K (cid:89) k =1 f k ( X i | ψ ) Z i,k ( q Z i,k k ) − W i ( q Z i,k k ( i − k ) W i (cid:35) × [ ( W i = 1) ( V i >

0) + ( W i = 0) ( V i ≤ × φ ( V i − β (cid:48) (cid:126)d i ) (cid:35) (cid:34) K (cid:89) k =1 q Z ,k k q r k − k (cid:35)  K (cid:89) k =1 K (cid:89) k =1 q r k k − k k  π ( ψ ) π ( β ) , (A.1) where k ( i − = j if Z i − ,j = 1, and π ( β ) = | Σ | − / φ [Σ − / ( β − µ )]. Notation: φ ( . ) and φ ( . ) are thedensities of the uni- and bi-dimensional standard normal distributions, respectively. π ( ψ ) = K (cid:89) k =1 π NIG( µ k , σ k ; m k , v k , s k , s k ) ( µ < . . . < µ K ) . q , Q, ψ, β ) is:( q |· ) ∼ Dir (cid:34) r + n (cid:88) i =1 Z i (1 − W i ) (cid:35) , (A.2)( q k |· ) ∼ Dir (cid:34) r k + n (cid:88) i =2 ( Z i − ,k W i ) Z i (cid:35) , (A.3)( β |· ) ∼ N ( µ ∗ , Σ ∗ ) , (A.4)with Σ ∗ = (cid:16) Σ − + (cid:80) ni =1 (cid:126)d i (cid:126)d i (cid:48) (cid:17) − and µ ∗ = Σ ∗ (cid:16) Σ − µ + (cid:80) ni =1 V i (cid:126)d i (cid:17) .( µ k |· ) ∼ N( m ∗ k , v ∗ k ) and ( σ k |· ) ∼ IG( s ∗ k , s ∗ k ) , for k = 1 , . . . , K, (A.5)where v ∗ k = v k σ k v k (cid:80) ni =1 Z i,k , m ∗ k = v ∗ k (cid:16) m k + v k (cid:80) ni =1 Z i,k X i vσ k (cid:17) , s ∗ k = s k + ( (cid:80) ni =1 Z i,k ) and s ∗ k = s k + (cid:20) m k v k + (cid:80) ni =1 Z i,k X i − v k v k (cid:80) ni =1 Z i,k (cid:16) m k v k + (cid:80) ni =1 Z i,k X i (cid:17) (cid:21) .The sampling step of ( Z, W ) is Z ∼ Mult(1 , q ∗ ) , ( W i | Z i − ,j = 1 , · ) ∼ Ber(1 , p ∗ ( i,j ) ) , i = 2 , . . . , n, ( Z i | W i = l, Z i − ,j = 1 , · ) ∼ Mult(1 , q ∗ ( i,j,l ) ) , i = 2 , . . . , n, where q ∗ k = c ,k q k /a , p ∗ ( i,j ) = b i,j Φ + i /c i,j , (A.6) q ∗ ( i,j, k = c i +1 ,k f k ( X i | ψ ) q k /a i , q ∗ ( i,j, k = c i +1 ,k f k ( X i | ψ ) q j,k /b i,j . Consider c n +1 ,j = 1, ∀ j , and: a n = K (cid:88) k =1 f k ( X n | ψ ) q k , b n,j = K (cid:88) k =1 f k ( X n | ψ ) q j,k , c n,j = b n,j Φ + n + a n Φ − n , (A.7) a i = K (cid:88) k =1 c i +1 ,k f k ( X i | ψ ) q k , b i,j = K (cid:88) k =1 c i +1 ,k f k ( X i | ψ ) q j,k , c i,j = b i,j Φ + i + a i Φ − i , for i = n − , . . . , ,a = K (cid:88) k =1 c ,k q k . First, the calculations are performed recursively, starting from n and moving backwards in the ﬁltering partgiven in (A.7); here, b i and c i are K -vectors and a i is a scalar. Next, the probabilities shown in (A.6) arecalculated and ( Z, W ) are sampled.

Appendix B: Sensitivity analysis

Figure B.1 and Table B.1 report the sensitivity of the results to the choice of K (number of Gaussiancomponents). Here, the model is ﬁtted assuming seven conﬁgurations: K = 3, 4, 5, 6, 7, 8 and 9. TheDirichlet priors for q and Q are speciﬁed so that the degree of information in the prior (sum of the hyperpa-rameters of the Dirichlet) is the same across all conﬁgurations; see the description in Section 5. As expected,the results for K = 3 indicates lower/upper Gaussian components with high variance and accommodatingtoo many expressions. The weights of the lower/upper Gaussian components decreases as K increases. Interms of variance, the two smallest values are observed for large K in both cases. Table B.1 also shows that15he variance of the mixture involving only the internal normals is increasing with K . Note that the resultsbecome quite similar for K = 7 and 8 (Breast) and K = 8 and 9 (Ovarian). Recall that the Ovarian dataset is larger (6,541 more aligned genes) than the Breast data; this might explain the diﬀerences observedbetween them. In conclusion, it seems reasonable to choose the model with K = 8 (results shown in Section5).Figure B.1: Histogram of all expressions (Breast in row 1, Ovarian in row 2) overlaid by the estimatedmixture density (dashed curve) and its Gaussian components (grey and black curves) for seven diﬀerentchoices of K .Table B.1: Posterior estimates (for each size K , Ovarian data) of the weight, µ k and σ k related to the lowerand upper Gaussian components and the mean (expec.) and variance (var.) of the (normalised) mixture ofinternal normals (without the lower and upper components); standard errors in parentheses. K weight 1 µ σ weight K µ K σ K expec. var.3 0.180 (0.007) -8.331 (0.088) 4.705 (0.168) 0.101 (0.005) 1.760 (0.093) 5.724 (0.209) -2.914 (0.041) 6.008 (0.130)4 0.128 (0.008) -9.130 (0.114) 3.462 (0.159) 0.066 (0.005) 2.551 (0.124) 5.318 (0.268) -2.995 (0.058) 6.851 (0.157)5 0.069 (0.006) -10.020 (0.110) 2.704 (0.137) 0.041 (0.004) 3.022 (0.188) 5.877 (0.432) -3.197 (0.048) 8.765 (0.187)6 0.037 (0.004) -10.763 (0.124) 2.201 (0.157) 0.029 (0.003) 3.296 (0.233) 6.483 (0.576) -3.327 (0.031) 10.071 (0.167)7 0.033 (0.004) -10.941 (0.134) 2.057 (0.163) 0.018 (0.003) 3.283 (0.366) 7.803 (1.085) -3.276 (0.032) 10.638 (0.179)8 0.028 (0.003) -11.157 (0.130) 1.914 (0.158) 0.014 (0.002) 2.802 (0.197) 1.506 (0.339) -3.280 (0.027) 11.174 (0.152)9 0.026 (0.003) -11.228 (0.142) 1.855 (0.168) 0.012 (0.002) 2.913 (0.258) 1.482 (0.391) -3.276 (0.027) 11.265 (0.155) The next results are related to a prior sensitive analysis developed for the parameters of the Gaussiancomponents in the mixture with K = 8. Two diﬀerent speciﬁcations are studied here, the ﬁrst one hasthe same conﬁguration explored in Section 5. Consider ( µ k , σ k ) ∼ NIG( m k , , . , .

1) with m k = − − . − . − .

43, 1 .

43, 4 .

29, 7 .

14 and 10, for k = 1 , . . . ,

8, respectively. The second speciﬁcation assumeshigher variability by doubling the original standard deviation, i.e., ( µ k , σ k ) ∼ NIG( m k , , . , . m k and E ( σ k ) are not changed with respect to the ﬁrst option. Figure B.2 and Table B.2show the results; they are visually the same and this indicates robustness to the prior uncertainty of theseparameters. 16igure B.2: Comparison of prior speciﬁcations (Breast data). Histogram of all expressions overlaid by theestimated mixture density (dashed curve) and its components (grey and black curves).Table B.2: Comparison of prior speciﬁcations (Breast data). Posterior estimates of: the weights, µ k , σ k , for k = 1 and 8 (lower and upper Gaussian components), mean (expec.) and variance (var.) of the normalisedmixture of internal normals; standard errors in parentheses. Prior weight 1 µ σ weight 8 µ σ expec. var.Original 0.026 (0.003) -11.168 (0.162) 2.125 (0.200) 0.014 (0.002) 2.666 (0.312) 1.436 (0.350) -3.247 (0.029) 11.666 (0.158)Vague 0.026 (0.003) -11.170 (0.155) 2.111 (0.190) 0.014 (0.002) 2.657 (0.286) 1.312 (0.351) -3.244 (0.028) 11.653 (0.153) Appendix C: MCMC diagnostics

This section presents some results to show that the MCMC designed for our gene expression analysishas good convergence properties. Two aspects explaining this behaviour are: the chosen blocking schemefor the algorithm and the fact that all parameters, including (

Z, W ), can be directly sampled from theirfull conditional distributions. The graphs in Figures C.1 and C.2 explore some MCMC chains for the block(

Z, W ). Here, the trace plots indicate good mixing properties and the trajectories of the ergodic averagessuggest fast convergence.Figure C.1: Trace plots of: Z i and W i (from location i = 1 , Z i and W i (from location i = 24 , . Z i , W i ) and three pairs of ( Z i , W i ) alongthe MCMC chain (Breast data). References

Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling.

Journal ofEducational and Behavioral Statistics 17 , 251–269.Amaratunga, D., J. Cabrera, and Z. Shkedy (2014).

Exploration and analysis of DNA microarrays and other high-dimensional data (2 ed. ed.). Hoboken: Wiley.Anders, S. and W. Huber (2010). Diﬀerential expression analysis for sequencing count data.

Genome Biology 11 .R106.Berger, M. F., J. Z. Levin, K. Vijayendran, A. Sivachenko, X. A. J. Maguire, L. A. Johnson, J. Robinson, R. G.Verhaak, C. Sougnez, R. C. Onofrio, L. Ziaugra, K. Cibulskis, E. Laine, J. Barretina, W. Winckler, D. E. Fisher,G. Getz, M. Meyerson, D. B. Jaﬀe, S. B. Gabriel, E. S. Lander, R. Dummer, A. Gnirke, C. Nusbaum, and L. A.Garraway (2010). Integrative analysis of the melanoma transcriptome.

Genome Research 20 , 413–427.Bivand, R. and G. Piras (2015). Comparing implementations of estimation methods for spatial econometrics.

Journalof Statistical Software 63 (18), 1–36.Bullard, J. H., E. Purdom, K. D. Hansen, and S. Dudoit (2010). Evaluation of statistical methods for normalizationand diﬀerential expression in mRNA-Seq experiments.

BMC Bioinformatics 11 (94).Chu, Y. and D. R. Corey (2012). RNA sequencing: platform selection, experimental design and data interpretation.

Nucleic Acid Therapeutics 22 (4), 271–274.Conesa, A., P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szczesniak, D. Gaﬀney,L. L. Elo, X. Zhang, and A. Mortazavi (2016). A survey of best practices for RNA-Seq data analysis.

GenomeBiology 17 (13).Dillies, M. A., A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel,J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloe, C. Le-Gall, B. Schaeﬀer, S. Le-Crom, M. Guedj, andF. Jaﬀrezic (2012). A comprehensive evaluation of normalization methods for Illumina high-throughput RNAsequencing data analysis.

Brieﬁngs in Bioinformatics 14 (6), 671–683.Faulkner, G. J., A. R. Forrest, A. M. Chalk, K. Schroder, Y. Hayashizaki, P. Carninci, D. A. Hume, and S. M.Grimmond (2008). A rescue strategy for multimapping short sequence tags reﬁnes surveys of transcriptionalactivity by cage.

Genomics 91 (3), 281–288.Fonseca, N. A., J. Rung, A. Brazma, and J. C. Marioni (2012). Tools for mapping high-throughput sequencing data.

Bioinformatics 28 (24), 3169–3177.Frazee, A. C., S. Sabunciyan, K. D. Hansen, R. A. Irizarry, and J. T. Leek (2014). Diﬀerential expression analysis ofDNA-Seq data at single-base resolution.

Biostatistics 15 (3), 413–426.Garber, M., M. G. Grabherr, M. Guttman, and C. Trapnell (2011). Computational methods for transcriptomeannotation and quantiﬁcation.

Nature Methods 8 (6), 469–477.Gentleman, R., V. Carey, D. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry,K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. Rossini, G. Sawitzki,C. Smith, G. Smyth, L. Tierney, J. Yang, and J. Zhang (2004). Bioconductor: open software development forcomputational biology and bioinformatics.

Genome Bioloy 5 . R80. an, Y., J. Chen, X. Zhao, C. Liang, Y. Wang, L. Sun, Z. Jiang, Z. Zhang, R. Yang, J. Chen, Z. Li, A. Tang, Z. Li,J. Ye, Z. Guan, Y. Gui, and Z. Cai (2011). MicroRNA expression signatures of bladder cancer revealed by deepsequencing. PLoS One 6 (3). e18286.Hansen, K. D., R. A. Irizarry, and Z. Wu (2012). Removing technical variability in RNA-seq data using conditionalquantile normalization.

Biostatistics 41 (2), 204–216.Horvath, S. and M. P. Baur (2000). Future directions of research in statistical genetics.

Statistics in Medicine 19 ,3337–3343.Irizarry, R. A., B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed (2003).Exploration, normalization, and summaries of high density oligonucleotide array probe level data.

Biostatistics 4 ,249–264.Law, C. W., Y. Chen, W. Shi, and G. K. Smyth (2014). voom: precision weights unlock linear model analysis toolsfor RNA-Seq read counts.

Genome Biology 15 (2). R29.Li, B., V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey (2010). RNA-Seq gene expression estimationwith read mapping uncertainty.

Bioinformatics 26 (4), 493–500.Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulationproblem.

Journal of the American Statistical Association 89 , 958–966.Lucas, J. E., H. N. Kung, and J. T. A. Chi (2010). Latent factor analysis to discover pathway associated putativesegmental aneuploidies in human cancers.

PLoS Computational Biology 6 . e1000920.Maher, C. A., C. Kumar-Sinha, X. Cao, S. Kalyana-Sundaram, B. Han, X. Jing, L. Sam, T. Barrette, N. Palanisamy,and A. M. Chinnaiyan (2009). Transcriptome sequencing to detect gene fusions in cancer.

Nature 458 (7234),97–101.Marioni, J. C., C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad (2008). RNA-Seq: an assessment of technical-reproducibility and comparison with gene expression arrays.

Genome Research 18 , 1509–1517.Mayrink, V. D. and F. B. Gon¸calves (2017). A Bayesian hidden Markov mixture model to detect overexpressedchromosome regions.

Journal of the Royal Statistical Society, Series C 66 (2), 387–412.Mayrink, V. D. and J. E. Lucas (2015). Bayesian factor models for the detection of coherent patterns in geneexpression data.

Brazilian Journal of Probability and Statistics 29 (1), 1–33.McCarthy, D. J., Y. Chen, and G. K. Smyth (2012). Diﬀerential expression analysis of multifactor RNA-Seq experi-ments with respect to biological variation.

Nucleic Acids Research 40 , 4288–4297.Moran, P. A. P. (1950). Notes on continuous stochastic phenomena.

Biometrika 37 (1), 17–23.Mortazavi, A., B. A. Williams, K. McCue, L. Schaeﬀer, and B. Wold (2008). Mapping and quantifying mammaliantranscriptome by RNA-Seq.

Nature Methods 5 (7), 621–628.Nookaew, I., M. Papini, N. Pornputtapong, G. Scalcinati, L. Fagerberg, M. Uhlen, and J. Nielsen (2012). A com-prehensive comparison of RNA-Seq-based transcriptome analysis from reads to diﬀerential gene expression andcross-comparison with microarrays: a case study in Saccharomyces cerevisiae.

Nucleic Acids Research 40 (20),10084–10097.Nueda, M. J., S. Tarazona, and A. Conesa (2014). Next maSigPro: updating maSigPro bioconductor package forRNA-Seq time series.

Bioinformatics 30 (18), 2598–2602.Oshlack, A., M. D. Robinson, and M. D. Young (2010). From RNA-Seq reads to diﬀerential expression results.

Genome Biology 11 (220).Oshlack, A. and M. J. Wakeﬁeld (2010). Transcript length bias in rna-seq data confounds systems biology.

BiologyDirect 11 (220).Papastamoulis, P. and M. Rattray (2017). A Bayesian model selection approach for identifying diﬀerentially expressedtranscripts from RNA sequencing data.

Journal of the Royal Statistical Society, Series C , n/a–n/a.Pollack, J. R., T. Sorlie, C. M. Perou, C. A. Rees, S. S. Jeﬀrey, P. E. Lonning, R. Tibshirani, D. Botstein, A. L. B.Dale, and P. O. Brown (2002). Microarray analysis reveals a major direct role of DNA copy number alterationin the transcriptional program of human breast tumors.

Proc. of the National Academy of Sciences of the UnitedStates of America 99 , 12963–12968.R Core Team (2017).

R: A language and environment for statistical computing

Bioinformatics 26 , 139–140.Robinson, M. D. and A. Oshlack (2010). A scaling normalization method for diﬀerential expression analysis ofRNA-Seq data.

Genome Biology 11 . R25.Soneson, C. and M. Delorenzi (2013). A comparison of methods for diﬀerential expression analysis of RNA-Seq data.

BMC Bioinformatics 14 (91).Trapnell, C., D. G. Hendrickson, M. Sauvageau, L. Goﬀ, J. L. Rinn, and L. Pachter (2013). Diﬀerential analysis ofgene regulation at transcript resolution with RNA-Seq.

Nature Biotechnology 31 (1), 46–53. an-De-Wiel, M. A., G. G. R. Leday, L. Pardo, H. Rue, A. W. Van-Der-Vaart, and W. N. Van-Wieringen (2013).Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics 14 (1), 113–128.Wang, Z., M. Gerstein, and M. Snyder (2009). RNA-Seq: a revolutionary tool for transcriptomics.

Nature ReviewsGenetics 10 (1), 57–63.Zhang, H., J. Xu, N. Jiang, X. Hu, and Z. Luo (2015). PLNseq: a multivariate Poisson lognormal distribution forhigh-throughput matched RNA-sequencing read count data.

Statistics in Medicine 34 , 1577–1589., 1577–1589.