[PDF] Bayesian estimation of Differential Transcript Usage from RNA-seq data

Abstract

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.

Full PDF

BBayesian estimation of Diﬀerential Transcript Usage fromRNA-seq data

Panagiotis Papastamoulis ∗ and Magnus Rattray Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine andHealth, University of Manchester, UK

Abstract

Next generation sequencing allows the identiﬁcation of genes consisting of diﬀerentiallyexpressed transcripts, a term which usually refers to changes in the overall expression level.A speciﬁc type of diﬀerential expression is diﬀerential transcript usage (DTU) and targetschanges in the relative within gene expression of a transcript. The contribution of this paperis to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesianmodel which is originally designed for identifying changes in overall expression levels and (b)propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is aread based model and performs fully Bayesian inference by MCMC sampling on the space oflatent state of each transcript per gene. BayesDRIMSeq is a count based model and estimatesthe Bayes Factor of a DTU model against a null model using Laplace’s approximation. Theproposed models are benchmarked against the existing ones using a recent independentsimulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesianmethods exhibit similar performance with DRIMSeq in terms of precision/recall but oﬀerbetter calibration of False Discovery Rate.

Keywords:

MCMC, Laplace approximation, alternative splicing, within gene transcript ex-pression, false discovery rate ∗ corresponding author: [email protected] a r X i v : . [ q - b i o . GN ] S e p Introduction

High throughput sequencing of cDNA (RNA-seq) [Mortazavi et al., 2008] is an important toolto quantify transcript expression levels and to identify diﬀerences between diﬀerent biologicalconditions. RNA-seq experiments produce a large number (millions) of short reads (nucleotidesequences) which are typically mapped to the genome or transcriptome. Expression quantiﬁca-tion requires estimating the number of reads originating from each transcript in a given sample.Quantifying the transcriptome between diﬀerent samples allows the identiﬁcation of diﬀeren-tially expressed (DE) transcripts between them. However, certain diﬃculties complicate theinference procedure. In higher eukaryotes, most genes are spliced into alternative transcriptswhich share speciﬁc parts of their sequence (exons). Hence, a given short read typically aligns todiﬀerent positions of the transcriptome and statistical models are often used to infer the originprobabilistically [Trapnell et al., 2010, 2013, Li and Dewey, 2011, Nicolae et al., 2011, Glauset al., 2012, Rossell et al., 2014, Hensman et al., 2015].Diﬀerential Transcript Expression (DTE) refers to the event where the overall relative ex-pression of a transcript changes between two conditions. In this case, θ k refers to the relativeexpression of transcript k ; k = 1 , . . . , K , with respect to the whole set of transcripts, with θ k (cid:62) (cid:80) Kk =1 θ k = 1. On the contrary, Diﬀerential Transcript Usage (DTU) refers to the event thatthe relative within gene abundance of a transcript changes between conditions. Consider a gene g = 1 , . . . , G with K g > θ ( g ) k = θ k (cid:80) j ∈ g θ j . Obviously, if a transcript belongs to a gene with K g = 1 then it isalways non-DTU. According to Gonz`alez-Porta et al. [2013] the dominant transcripts within agene are likely to be the main contributors to the proteome and switching events between themis a common scenario of gene modiﬁcation between conditions.Figure 1 illustrates the diﬀerences between DTE and DTU, considering a set of three genes(shown in red, blue and green) consisting of 2, 2 and 3 transcripts. In the case of DTE (upperpanel) the overall expression of transcripts 1 , , Differential transcript expression transcript e x p r e ss i on 0 . . . . . . condition A condition B gene 1gene 2gene 31 2 3 4 5 6 7 1 2 3 4 5 6 7 Differential transcript usage transcript e x p r e ss i on 0 . . . . . condition A condition B gene 1gene 2gene 3 Figure 1: Diﬀerential Transcript Expression (up) and Diﬀerential Transcript Usage (down).(green color) is not the same between conditions. In general, DTU implies DTE but the reverseis not necessarily true.In this paper we extend the use of two available methods in order to perform Bayesianinference for the problem of DTU. cjBitSeq [Papastamoulis and Rattray, 2017] was originallyintroduced as a Bayesian read-based model for DTE inference and here we modify it for theDTU problem. We also propose a Bayesian version of DRIMSeq [Nowicka and Robinson, 2016], acount-based approach originally introduced as a frequentist model for DTU inference. Genome-scale studies incorporate a large number of multiple tests, typically at the order of tens ofthousands. A crucial issue under a multiple comparisons framework is the control of the FalseDiscovery Rate (FDR), that is, the expected proportions of errors among the rejected hypotheses[Benjamini and Hochberg, 1995]. According to a recent benchmarking study [Soneson et al.,2015], the ability of frequentist count-based methods to control the FDR is drastically improvedby pre-ﬁltering low-expressed transcripts. This remains true for the Bayesian version of thecount-based method presented here (DRIMSeq). However it is not possible to incorporate sucha strategy for read-based methods (cjBitSeq) where transcript expression levels are not known3 priori. Therefore, under our Bayesian framework, we also propose the use of transformationsof the raw posterior probabilities and ﬁltering the output based on the notion of trust regionswhich are motivated from realistic scenarios of gene regulation [Gonz`alez-Porta et al., 2013].The rest of the paper is organized as follows. In Section 2 we brieﬂy describe existingmethods. The proposed Bayesian models are presented in Section 3. More speciﬁcally, Section3.1 reviews the cjBitSeq framework and also introduces the necessary prior modiﬁcations forthe problem of DTU. The likelihood of the DRIMSeq model is presented in Section 3.2 and aBayesian version is introduced next, along with a detailed description of the inference. Section 4deals with FDR control procedures. In Section 5 we report our ﬁndings on synthetic data usingthe carefully designed simulation study of Soneson et al. [2015]. In Section 5.1 we comparecjBitSeq and BayesDRIMSeq with respect to the decision rules of Section 4 using power versusachieved FDR plots. In Section 5.2 we benchmark these methods against existing ones andwe also report more performance measures, such as ROC and precision/recall curves as wellas comparisons in terms of run-time and memory requirements. A real RNA-seq dataset isanalysed in Section 6. The manuscript concludes with a Discussion. A prior sensitivity analysisof BayesDRIMSeq as well as a comparison between alternative inputs of BayesDRIMSeq andDRIMSeq based on diﬀerent quantiﬁcation methods is provided in the Appendix. cuﬀdiﬀ

The cuﬄinks/cuﬀdiﬀ [Trapnell et al., 2010, 2013] pipeline estimates the expression ofa set of transcripts and then performs various diﬀerential expression tests both on the transcriptand gene level. DTU at the gene level is based on comparing the similarity of two distributionsusing the square root of the Jensen-Shannon divergence [Osterreicher and Vajda, 2003, Endresand Schindelin, 2003]. Following Soneson et al. [2015], we used the gene-wise FDR estimatesfrom the cds.diff output ﬁle of cuﬀdiﬀ (version 2.2.1).

DEXSeq

DEXSeq [Anders et al., 2012] is the most popular method for inferring DTU. Thegenome is divided into disjoint parts of exons (counting bins) and a matrix of read counts intothe counting bins is used as input. The default method for counting reads for this purpose4s HTSeq [Anders et al., 2015]. Given the estimated reads from HTSeq, a negative binomialgeneralized linear model is ﬁt and DTU is inferred by testing whether the interaction termbetween conditions is diﬀerent from zero.

DRIMSeq

This recent package [Nowicka and Robinson, 2016] implements a dirichlet-multinomialmodel in order to describe the variability between replicates. A likelihood ratio test is performedin order to compare a full model with distinct parameters per condition and a null model whichassumes that the parameters are shared. The input is a matrix of counts per transcript. Weapplied this method using the following ﬁltering criteria: • min gene expr = 1 (Minimal gene expression in cpm) • min feature prop = 0 .

01 (Minimal proportion for feature expression) • min samps gene expr = 3 (Minimal number of samples where genes should be expressed) • min samps feature prop = 3 (Minimal number of samples where features should be ex-pressed) edgeR The function spliceVariants from the edgeR [Robinson et al., 2010] package can beused to identify genes showing evidence of splice variation using negative binomial generalizedlinear models. For each gene (containing at least two transcripts) a likelihood ratio test comparesa model with an interaction term between each condition against a null model with no interactionterm. The input corresponds to a matrix of counts per transcript. limma

The function diffSplice from the limma [Ritchie et al., 2015] package also tests forDTU by ﬁtting negative binomial generalized linear models and performing a likelihood ratiotest at the diﬀerence of log-fold changes. The input corresponds to a matrix of counts pertranscript. 5

New Bayesian approaches cjBitSeq was originally applied to problem of inferring transcripts with DTE and here this modelis modiﬁed for the problem of DTU. DRIMSeq is a frequentist-based approach for the problemof DTU and this model is now extended under a Bayesian framework. cjBitSeq is a read-basedmodel, that is, the observed data is a matrix of alignments of each read to the transcriptome. Onthe other hand, DRIMSeq is a count-based model, which uses as input a matrix of (estimated)counts corresponding to the number of reads originating from each transcript. Both methodsreport an estimate of the posterior probability of DTU per gene. cjBitSeq performs collapsedGibbs sampling on the space of latent states of each transcript, that is, a binary vector with0 corresponding to equally expressed (EE) transcripts and 1 otherwise. Bayesian DRIMSeqestimates the Bayes factor between a DTU and a null model. Therefore, cjBitSeq also reportsa posterior probability of DTU for each transcript which may be of interest for transcript-levelanalysis. In this study we focus our attention at the gene-level summaries as done in Sonesonet al. [2015].Both models take advantage of distributions with richer covariance structures compared tostandard sampling schemes: in particular, the Generalized Dirichlet distribution is arising as afull conditional distribution at the cjBitSeq model, while DRIMSeq is based on the Dirichlet-Multinomial distribution. The Generalized Dirichlet distribution allows for positive correlationsbetween proportions, something that it is not the case for a standard Dirichlet model, and theDirichlet-Multinomial distribution exhibits extra variation compared to a multinomial model.Interestingly, we note that both distributions were introduced by the same author [Mosimann,1962, Connor and Mosimann, 1969].

Let x = ( x , . . . , x r ), x i ∈ X , i = 1 , . . . , r , denote a sample of r short reads aligned to a given setof K transcripts. The sample space X consists of all sequences of letters A, C, G, T. Assuming6hat reads are independent, the joint probability density function of the data is written as x | θ ∼ r (cid:89) i =1 K (cid:88) k =1 θ k f k ( x i ) . (1)The number of components ( K ) is equal to the number of transcripts and it is considered asknown since the transcriptome is given. The parameter vector θ = ( θ , . . . , θ K ) ∈ P K − denotesrelative abundances, where P K − := { p k (cid:62) , k = 1 , . . . , K − K − (cid:88) k =1 p k (cid:54) p K := 1 − K − (cid:88) k =1 p k } . The component speciﬁc density f k ( · ) corresponds to the probability of a read aligning at someposition of transcript k , k = 1 , . . . , K . Since we assume a known transcriptome, { f k } Kk =1 areknown as well and they are computed according to the methodology described in Glaus et al.[2012], taking into account optional position and sequence-speciﬁc bias correction method.Papastamoulis and Rattray [2017] proposed a Bayesian model selection approach for iden-tifying diﬀerentially expressed transcripts from RNA-seq data. The methods builds upon theBitSeq model [Glaus et al., 2012, Papastamoulis et al., 2014, Hensman et al., 2015]. Comparedto other approaches, the main diﬀerence of cjBitSeq is that transcript expression and diﬀerentialexpression is jointly modelled. In contrast to other methods where the starting point of the DEanalysis is a count matrix, the input of cjBitSeq is the matrix L containing alignment proba-bilities of each read to the transcriptome. According to Equation (1), the probability of read i aligning at transcript k is given by L ik = f k ( x i ) for i = 1 , . . . , r and k = 1 , . . . , K .Assume that we have at hand two samples x := ( x , . . . , x r ) and y := ( y , . . . , y s ), with r and s denoting the number of (mapped) reads for sample x and y , respectively. Now, let θ k and w k denote the unknown relative abundance of transcript k = 1 , . . . , K in sample x and y ,respectively. Deﬁne the parameter vector of relative abundances as θ = ( θ , . . . , θ K − ; θ K ) ∈P K − and w = ( w , . . . , w K − ; w K ) ∈ P K − . Under the standard BitSeq model the prior onthe parameters θ and w would be a product of independent Dirichlet distributions. In thiscase the probability θ k = w k under the prior is zero and it is not straightforward to deﬁnenon-DE transcripts. To model diﬀerential expression we would instead like to identify instanceswhere transcript expression has not changed between samples. Therefore, we introduce a ﬁnite7robability for the event θ k = w k . This leads us to deﬁne a new model with a non-independentprior for the parameters θ and w . Deﬁnition 1 (State vector) . Let c := ( c , . . . , c K ) ∈ C , where C is the set deﬁned by:1. c k ∈ { , } , k = 1 , . . . , K c + := (cid:80) Kk =1 c k (cid:54) = 1 .Then, for k = 1 , . . . , K let:  θ k = w k , if c k = 0 θ k (cid:54) = w k , if c k = 1 . We will refer to vector c as the state vectorof the model. cjBitSeq was originally applied to the problem of DTE by introducing a cluster representationof aligned reads to transcripts. This clustering approach substantially reduces the dimension-ality of the samp ling space and makes the MCMC sampler converge to reasonable time. Itis important to mention that clusters are deﬁned under a data-driven algorithm, that is, bysearching the alignments of each read and identifying groups of transcripts sharing reads.Under the same approach, we would be able to infer clusters of transcripts with DTU.However, since in this work we focus on inference at the gene level, we impose the assumptionthat clusters are deﬁned as the transcripts of each gene. Otherwise, in some instances it willnot be straightforward to perform inference at the gene level, due to the possibility of clustersof transcripts merging multiple genes together. For example, we found that approximately 4 . p c t uv q w x z xy a g Figure 2: Directed Acyclic Graph representation for the cjBitSeq model. Squares and circlesrepresent unknown and observed/ﬁxed quantities, respectively.outperforms other choices. However, this choice introduces a prior bias to the case of DTU sincegenes with larger number of transcripts are assigned larger prior probability of DTU than geneswith small number of transcripts. Therefore, now it is a priori assumed that the probabilityof no diﬀerential expression within a gene is equally weighted with the event that at least twotranscripts exhibit DTU, that is, P ( c + = 0) = 0 .

5. An equal prior probability is assigned to therest possible conﬁgurations. Thus, the prior distribution on the state vector is deﬁned as:P( c ) = P( c | c + (cid:54) = 1) =  . , c + = 0 . K − K − , c + (cid:62) . (2)This modiﬁcation is necessary in order to ensure that no prior bias is enforced at the gene-levelwhich is the aim of the analysis in the DTU setup.A graphical model of the cjBitSeq prior assumptions is shown in Figure 2. The binary statevector c = ( c , . . . , c K ) deﬁnes diﬀerentially or equally expressed transcripts within each gene.The prior distribution of c is given by Equation (2), although in the general implementation ofPapastamoulis and Rattray [2017] an extra level of hierarchy is imposed by the hyper-parameter π , shown in Figure 2. The parameters u and v are a-priori independent Dirichlet randomvariables. The dimension of u is equal to K , i.e. the number of transcripts for a given gene. Onthe other hand, v is a random variable with varying dimension, which is deﬁned by the numberof diﬀerentially expressed transcripts, that is, (cid:80) Kk =1 c k . The parameters u and v along withan auxiliary parameter τ deﬁne via a suitable one-to-one transformation the actual transcript9xpression parameters θ and w . According to Theorem 1 of Papastamoulis and Rattray [2017], θ and w are marginally Dirichlet random variables, however they are not independent since theprobability of the events { θ k = w k ; k = 1 , . . . , K } is positive. At the next level of hierarchy, thelatent allocation variables ξ and z deﬁne the transcript allocation of each read from sample x and y , respectively, through the equations P ( ξ i = k ) = θ k , independent for i = 1 , . . . , r , and P ( z j = k ) = w k , independent for j = 1 , . . . , s .Papastamoulis and Rattray [2017] showed that the model is conjugate given c . But inorder to update ( c , v ), a reversible-jump mechanism [Green, 1995, Richardson and Green, 1997,Papastamoulis and Iliopoulos, 2009] is required. However, this step can be avoided by analyticalintegration of ( u , v ). Thus, a collapsed Gibbs sampler [Geman and Geman, 1984, Gelfand andSmith, 1990, Liu, 1994, Liu et al., 1995] updates the latent allocation variables ( ξ and z ) of eachread to its transcript of origin as well as the binary variables c k of each transcript state (DE orEE). Let x − [ i ] denote the vector arising from x after excluding its i -th entry. A pseudo-codedescription of the collapsed Gibbs MCMC sampler is:1. Update allocation variables for sample x : ξ i | ξ [ − i ] , z , c, x , y , i = 1 , . . . , r .2. Update allocation variables for sample y : z j | ξ , z [ − j ] , c, x , y , j = 1 , . . . , s .3. Draw a random sample (without replacement) of indices ( j , j ) from { , . . . , K } and up-date the block of state vector c j ,j | c − [ j ,j ] , ξ , z , x , y .4. Update ( θ , w , τ, u , v ) | c, ξ , z , x , y (optional).Note that the update 4 is optional in the sense that it is not required by any of the previous steps,however one can include it in order to also obtain MCMC samples of the transcript expressionparameters θ and w . For a detailed description of the conditional distributions involved in steps1–4 (as well as the alternative RJMCMC sampler) see Papastamoulis and Rattray [2017].According to our model, it is natural to call a gene as DE if at least two transcripts exhibitDTU. Hence, the posterior probability of DTU for a gene g is deﬁned as p g = P { c + > | x , y } , g = 1 , . . . , G, (3)and it is estimated by the corresponding ergodic average across the MCMC run (after burn-in).10 .2 BayesDRIMSeq Let n = n g denotes the total number of reads aligning to a gene g with k transcripts, g = 1 , . . . , G .Assume that X = X g = ( X , . . . , X k ) is the vector of reads originating from each transcript,according to an underlying vector θ = θ g = ( θ , . . . , θ k ) of relative abundances which is unknown.A priori, a Dirichlet prior is imposed on θ and, given θ , the observed reads are generatedaccording to a multinomial distribution, that is, θ ∼ D ( δ , . . . , δ k ) X | θ ∼ Multinomial( n, θ )Integrating out θ , this model leads to the Dirichlet-Multinomial [Mosimann, 1962] distribution:P( X = x ) = (cid:18) n x (cid:19) Γ( δ + )Γ( n + δ + ) k (cid:89) j =1 Γ( δ j + x j )Γ( δ j ) , where the ﬁrst term in the product denotes the multinomial coeﬃcient and δ + = (cid:80) Kk =1 δ k . Wewill write: X | n, δ ∼ DM ( n, δ ). It can be shown that E X = n π and Var X = (cid:26) n − δ + + 1 (cid:27) n { diag( π ) − ππ (cid:48) } , where π = { δ j /δ + ; j = 1 , . . . , k − } and diag( π ) denotes a diagonal matrix with diagonal entriesequal to π , . . . , π k − . Note that as δ + → ∞ the variance-covariance matrix of the Dirichlet-multinomial distribution reduces to n { diag( π ) − ππ (cid:48) } , that is, the variance-covariance matrixof the multinomial distribution. In any other case extra variation is introduced compared tostandard multinomial sampling, a well known property of the Dirichlet-multinomial distribution(see e.g. Neerchal and Morel [1998]).Consider now that a matrix of (estimated) read counts is available for two diﬀerent conditions,consisting of n and n replicates. Given two hyper-parameter vectors δ , δ , let X ( g ) i | n i , δ ∼ DM ( n i , δ ) , independent for i = 1 , . . . , n Y ( g ) j | n j , δ ∼ DM ( n j , δ ) , independent for j = 1 , . . . , n , X ( g ) i , Y ( g ) j denote two independent vectors of (estimated) number of reads for the tran-scripts of gene g = 1 , . . . , G for replicate i = 1 , . . . , n and j = 1 , . . . , n for the ﬁrst and secondcondition, respectively. Obviously, n i and n j denote the total number of reads generated fromgene g for the ﬁrst and second condition for replicates i and j .In this context, DTU inference is based on comparing the hyper-parameters of the Dirichlet-Multinomial distribution. Note that δ and δ is proportional to the average expression levelof the speciﬁc set of transcripts. Typically, there are large diﬀerences in the scale of theseparameters, thus their direct comparison does not reveal any evidence for DTU. For this reason,it is essential to reparametrize the model as follows: δ = d g (4) δ = d g , (5)where d > d > g = ( g , . . . , g k ), g = ( g , . . . , g k ), with (cid:80) ki =1 g i = (cid:80) ki =1 g i = 1and g i , g i > i = 1 , . . . , k .In this case, DTU inference is based on comparing the null model: M : g = g versus the full model where M : g (cid:54) = g . A likelihood ratio test is implemented in the DRIMSeq package for testing the hypothesis of thenull versus the full model. In this work, we propose to compare the two models by applyingapproximate Bayesian model selection techniques. In particular, a priori it is assumed that d i ∼ E ( λ ) , independent for i = 1 , g i ∼ D (1 , . . . ,

1) independent for i = 1 , , (7)and furthermore d i and g j are mutually independent.In order to perform Bayesian model selection, the Bayes factor [Kass and Raftery, 1995]of the null against the full model is approximated using a two stage procedure. At ﬁrst, theposterior distribution of each model is approximated using Laplace’s approximation [Laplace,12774, 1986], a well established practice for approximating posterior moments and posterior dis-tributions [Tierney and Kadane, 1986, Tierney et al., 1989, Azevedo-Filho and Shachter, 1994,Raftery, 1996]. Then, the logarithm of marginal likelihoods of M and M are estimatedusing independent samples from the posterior distribution via self-normalized sampling impor-tance resampling [Gordon et al., 1993]. Finally, the posterior probabilities p ( M | x ( g ) , y ( g ) ), and p ( M | x ( g ) , y ( g ) ) are estimated assuming equally weighted prior probabilities.Denote by g the common value of g , g in model M . Let u = ( g , d , d ) ∈ U , u =( g , g , d , d ) ∈ U denote the parameters associated with models M and M , respectively.Obviously, the underlying parameter spaces are deﬁned as U = P K g − × (0 , + ∞ ) and U = P K g − × (0 , + ∞ ) . The marginal likelihood of data under model M j , is deﬁned as f ( x ( g ) , y ( g ) |M j ) = (cid:90) U j f ( x ( g ) , y ( g ) | u j ) f ( u j | λ )d u j , j = 0 , . According to the basic importance sampling identity, the marginal likelihood model can beevaluated using another density φ , which is absolutely continuous on U j , as follows f ( x , y |M j ) = (cid:90) U j f ( x ( g ) , y ( g ) | u j ) f ( u j | λ ) φ ( u j ) φ ( u j )d u j . The minimum requirement for φ is to satisfy φ ( u j ) > f ( x ( g ) , y ( g ) | u j ) f ( u j | λ ) > { u ( i ) ; i = 1 , . . . , n } is drawn from φ ( · ). Then, the importance samplingestimate of the marginal likelihood is (cid:98) f ( x ( g ) , y ( g ) |M j ) = 1 n n (cid:88) i =1 f ( x ( g ) , y ( g ) | u ( i ) j ) f ( u ( i ) j | λ ) φ ( u ( i ) j ) , j = 0 , . The candidate distribution φ is the approximation of the posterior distribution according to theLaplace’s method. It is well known that basic importance sampling performs reasonably well incases that the number of parameters is not too large. However, it can be drastically improvedusing sequential Monte Carlo methods, such as sampling importance resampling [Gordon et al.,1993, Liu and Chen, 1998]. The R package LaplacesDemon [Statisticat and LLC., 2016] is usedfor this purpose.Finally, the posterior probability of the DTU model is deﬁned as p g = P ( M | x ( g ) , y ( g ) ) ∝ f ( x ( g ) , y ( g ) |M ) P ( M ) , g = 1 , . . . , G, (8)13y also assuming equally weighted prior probabilities, that is, P ( M ) = P ( M ) = 0 .

5. Notethat the Bayes Factor of the null against the full model is then given by B ( g )01 = P ( M | x ( g ) , y ( g ) ) P ( M | x ( g ) , y ( g ) ) = f ( x ( g ) , y ( g ) |M ) f ( x ( g ) , y ( g ) |M ) , g = 1 , . . . , G since the prior odds ratio is equal to one.In case that low expressed transcripts are included in the computation, the Laplace approx-imation faces many convergence problems. We have found that this problem can be alleviatedby pre-ﬁltering low expressed transcripts, as also pointed out by Soneson et al. [2015]. In this section we consider various decision rules in order to control the False Discovery Rate(FDR) [Benjamini and Hochberg, 1995, Storey, 2003, M¨uller et al., 2004, 2006]. Decision rules(9) and (11) are taking into account the whole set of genes and make use of the raw andtransformed posterior probabilities, respectively. Intuitively, the transformation of posteriorprobabilities prioritizes genes consisting of transcripts with large changes in their expression.Decision rules (10) and (12) are based on ﬁltering the output of (9) and (11) according to atrust region.A decision rule based on the raw gene-level posterior probabilities of DTU, as deﬁned inEquations (3) and (8), is the following. d g =  , (cid:98) p g (cid:62) − α , otherwise. (9)Note that for the problem of inferring DTE the decision rule (9) is the one used by Leng et al.[2013]. However, the cjBitSeq model takes into account changes to any subset of transcriptswithin a gene, thus, (9) may identify a large number of genes consisting of relatively smallchanges in low expressed transcripts. A more conservative choice will focus our attention to thedominant transcripts, where more reads are available and potentially the results will be morerobust. 14ext we deﬁne a ﬁltering of the output based on a “trust region”. Let i and j denote theestimated dominant transcripts in condition A and B, respectively. The trust region correspondsto the subset of genes where the relative ordering of estimated expression levels of dominanttranscript switches, that is, G = { g = 1 , . . . , G : ( (cid:98) θ ( g ) i − (cid:98) θ ( g ) j )( (cid:98) w ( g ) i − (cid:98) w ( g ) j ) < } . Switching events between dominant transcripts have been proposed as a major source of DTUin real RNA-seq data [Gonz`alez-Porta et al., 2013].Note that in the previous expression we used the notation of transcript expression levels ac-cording to cjBitSeq. For BayesDRIMSeq θ and w should be replaced by g and g , respectively.The decision rule which corresponds to ﬁltering (9) according to G is the following: d g =  , (cid:98) p g (cid:62) − α and g ∈ G , otherwise. (10)Note that decision rules d and d are solely based on the posterior probabilities of gene DTUand the trust region, respectively. However, it makes sense to also take into account addi-tional information, such as the magnitude of the change of the within gene relative transcriptexpression, which is a by-product of our algorithm.In order to clarify this, consider the following example. Assume that genes g and g bothconsist of two transcripts. For g , let θ ( g )1 = 0 . θ ( g )2 = 0 . w ( g )1 = 0 . w ( g )2 = 0 . g , let θ ( g )1 = 0 . θ ( g )2 = 0 . w ( g )1 = 0 . w ( g )2 = 0 .

4. Furthermore, assume that theposterior evidence of DE is the same for both genes, that is, (cid:98) p g = (cid:98) p g = p . In the case that theposterior probability p is suﬃciently large, genes g and g will be given the same importancein our discovery list. Note however that for gene g the absolute change in relative expressionis 4 times larger than for gene g . Ideally, we would like our discovery list to rank higher gene g than gene g . This is achieved using the following FDR control procedure.Consider any (Bayesian) method that for each gene yields an estimate of the posterior prob-ability of DTU per gene p g , g = 1 , . . . , G . • For a given permutation τ = ( τ , τ , . . . , τ G ) of { , , . . . , G } and let q g = p τ g , g = 1 , . . . , G .15 Deﬁne: r g = (cid:80) gj =1 − q j g , g = 1 , . . . , G . • For 0 < α <

1, consider the decision rule: d g =  , (cid:54) g (cid:54) g ∗ , g ∗ + 1 (cid:54) g (cid:54) G (11)where g ∗ := max { g = 1 , . . . , G : r g (cid:54) α } . • (cid:98) E (FDR | data) = (cid:80) g ∗ j =1 − q j g ∗ (cid:54) α Here we mention that in the original implementation of cjBitSeq for the DTE problem, thepermutation τ was deﬁned as the one that orders the posterior probabilities of transcript DE indecreasing order.The permutation that takes into account the previously described concept of magnitudechange is deﬁned as follows. Let ρ g = max | (cid:98) θ ( g ) k − (cid:98) w ( g ) k | , k = 1 , . . . , K g , where (cid:98) θ ( g ) k and (cid:98) w ( g ) k denote the posterior mean estimates of within gene transcript expression for a given transcript k of gene g . Consider the permutation τ = ( τ , τ . . . , τ G ) that orders the set { ρ g ; g = 1 , . . . , G } in decreasing order, that is: ρ τ (cid:62) ρ τ (cid:62) . . . (cid:62) ρ τ G . Finally, we combine decision rule d with the trust region G to obtain our ﬁnal decisionrule, that is, d g =  , (cid:54) g (cid:54) g ∗ and g ∈ G , otherwise. (12) In order to assess the performance of the proposed methods and decision rules as well as to com-pare against existing models, a set of simulation studies is used. Instead of setting up our ownsimulation scenarios, we followed the pipeline introduced in the recent study of Soneson et al.[2015], where a large number of count-based method is being benchmarked. Synthetic RNA-seq16 f r equen cy f r equen cy Figure 3: Frequencies (in log scale) of number of annotated transcripts per gene for drosophila(up) and human (down). The total number of genes and transcripts is 13937 and 26951 fordrosophila and 20410 and 145342 for human, respectively.reads are generated from the Drosophila Melanogaster and Homo Sapiens transcriptomes usingthe RSEM-simulator [Li and Dewey, 2011]. The model parameters for RSEM-simulator were es-timated from real datasets using a Negative Binomial model described in Soneson and Delorenzi[2013]. The transcriptomes of these two organisms exhibit strong diﬀerences as illustrated inFigure 3. The average number of transcripts per gene is considerably smaller for fruit ﬂy, how-ever the transcripts are longer than for human (see also Supplementary Table 1 of Soneson et al.[2015]).Following Soneson et al. [2015], for each organism we simulated 3 replicates per condition.Each replicate consists of 25 million paired-end reads with length 101 base-pairs. Diﬀerentialtranscript usage was introduced for 1000 genes, by reversing the relative abundance of the twomost abundant transcripts in one of the two conditions. The total number of reads for each17ranscript may or may not be equal across conditions. If the total number of reads generatedfrom a gene is constant, no gene-level diﬀerential expression is evident. For the drosophila readsno gene-level diﬀerential expression was introduced. For human reads both cases are considered.Finally, the simulated reads are mapped to the genome or transcriptome with Tophat2 [Trapnellet al., 2009] and Bowtie2 [Langmead et al., 2009], respectively. Cuﬄinks and HTSeq used thealignment ﬁles produced by Tophat2, while BitSeqVB and cjBitSeq use the alignment producedfrom Bowtie2, allowing a maximum of 100 hits per read. The count matrix used as input toDEXSeq is estimated using the default HTSeq method, while BitSeqVB is used for input toedgeR, limma, DRIMSeq and BayesDRIMSeq.

Figure 4 displays the power versus achieved FDR using the decision rules d k ; k = 1 , , , α = 0 . , . , . , .

1, which are shown as dashed vertical lines. The plotted pointscorrespond to the achieved FDR ( x axis) and the proportion of true discoveries ( y axis). Theability of each decision rule to control the FDR depends on the distance of each point fromthe corresponding vertical line: the closer, the better. On the other hand, a decision rule withhigher y values is more powerful.For cjBitSeq (upper panel) we conclude that the trust-region adjusted rules d and d achievelower FDRs which are quite close to the expected values. However, note that d yields betterpower compared to d , especially for the human datasets. BayesDRIMSeq is shown in middleand lower panel of 4. At the second panel of Figure 4 we have applied BayesDRIMSeq byﬁltering out transcripts with average number of reads less than 20. The results correspondingto the full set of transcripts (no pre-ﬁltering) are shown at the lower panel of 4. We concludethat isoform pre-ﬁltering is essential in order to achieve reasonable control of FDR in the caseof human data. Note also that under isoform pre-ﬁltering the trust region does not have a highimpact on BayesDRIMSeq. 18 .00 0.05 0.10 0.15 0.20 0.25 . . . achieved FDR T r ue P o s i t i v e llll d1d2d3d4 0.00 0.05 0.10 0.15 0.20 0.25 . . . . . . achieved FDR T r ue P o s i t i v e llll d1d2d3d4 0.00 0.05 0.10 0.15 0.20 0.25 . . . . . . achieved FDR T r ue P o s i t i v e llll d1d2d3d40.00 0.02 0.04 0.06 . . . . achieved FDR T r ue P o s i t i v e llll d1d2d3d4 0.00 0.04 0.08 0.12 . . . . achieved FDR T r ue P o s i t i v e llll d1d2d3d4 0.00 0.02 0.04 0.06 0.08 0.10 . . . . achieved FDR T r ue P o s i t i v e llll d1d2d3d40.00 0.02 0.04 0.06 0.08 . . . . . achieved FDR T r ue P o s i t i v e llll d1d2d3d4 0.00 0.05 0.10 0.15 0.20 . . . achieved FDR T r ue P o s i t i v e ll ll d1d2d3d4 0.0 0.1 0.2 0.3 0.4 0.5 . . . . achieved FDR T r ue P o s i t i v e ll ll d1d2d3d4 (a) (b) (c)Figure 4: Power versus achieved FDR plot using the decision rules d , d , d and d for cjBitSeq(1st row) and BayesDRIMSeq with (second row) and without (third row) isoform pre-ﬁltering onthe simulated data. The vertical dashed lines show the expected FDR level (0.01, 0.025,0.05,0.1).(a): drosophila, (b): human without DTE and (c): human with DTE.19 .00 0.05 0.10 0.15 0.20 . . . . . . achieved FDR T r ue P o s i t i v e ll ll ll l cjBitSeqBayesDRIMSeqDRIMSeqcuffdifedgeRlimmaDEXSeq False positive rate T r ue po s i t i v e r a t e . . . . . . . . cjBitSeqBayesDRIMSeqDRIMSeqcuffdiffedgeRlimma Recall P r e c i s i on . . . . . cjBitSeqBayesDRIMSeqDRIMSeqcuffdiffedgeRlimma0.0 0.1 0.2 0.3 0.4 0.5 . . . . . . achieved FDR T r ue P o s i t i v e ll ll lll cjBitSeqBayesDRIMSeqDRIMSeqcuffdifedgeRlimmaDEXSeq False positive rate T r ue po s i t i v e r a t e . . . . . . . . cjBitSeqBayesDRIMSeqDRIMSeqcuffdiffedgeRlimma Recall P r e c i s i on . . . . . cjBitSeqBayesDRIMSeqDRIMSeqcuffdiffedgeRlimma0.0 0.1 0.2 0.3 0.4 0.5 . . . . . . achieved FDR T r ue P o s i t i v e ll ll lll cjBitSeqBayesDRIMSeqDRIMSeqcuffdifedgeRlimmaDEXSeq False positive rate T r ue po s i t i v e r a t e . . . . . . . . cjBitSeqBayesDRIMSeqDRIMSeqcuffdiffedgeRlimma Recall P r e c i s i on . . . . . cjBitSeqBayesDRIMSeqDRIMSeqcuffdiffedgeRlimma (a) (b) (c)Figure 5: Performance measures for drosophila (1st row), human without DTE (2nd row) andhuman with DTE (3rd row). (a): power versus achieved FDR plot. The vertical dashed linesshow the expected FDR level (0.01, 0.025,0.05,0.1). (b): ROC curve. (c): Precision/recall curve.20 .2 Comparison against existing methods For cuﬀdiﬀ, DRIMSeq, edgeR and limma we use the gene-level p-values at the ROC and pre-cision/recall plots and the adjusted q-values at the power versus achieved FDR plot. However,dexSeq reports only the adjusted q-values, hence this method is not shown at ROC and pre-cision/recall curves. Note that for all these methods, the adjusted q-values correspond to theBenjamini and Hochberg [1995] FDR control procedure. For cjBitSeq we used the raw FDR rate(11) at the ROC and precision/recall curves and the adjusted FDR (12) at the power versusachieved FDR plots. For BayesDRIMSeq we used the raw FDR rate (11) at all plots, afterpre-ﬁltering isoforms with an average number of reads less than 20.The performance measures of the evaluated methods are shown in Figure 5. Comparingresults for the two organisms, it is clear that edgeR, limma, dexSeq and frequentist DRIMSeqexhibit large diﬀerences in their ability to control the FDR. In particular, these methods exhibitsigniﬁcantly larger False Discovery rates for the human datasets compared to drosophila. Onthe other hand, cjBitSeq and BayesDRIMSeq are able to produce consistent results in all cases,being able at the same time to control the FDR within the (0 , .

1) area.More speciﬁcally, for the drosophila example observe that cjBitSeq exhibits smaller achievedFDR rate and larger True Positive Rate compared to DRIMSeq, dexSeq and edgeR. BayesDRIM-Seq achieves even smaller FDR rates but the number of True Positives is reduced compared tocjBitSeq. For the human examples we conclude that cjBitSeq exhibit almost similar performancein terms of FDR control, however the former is able to discover a larger number of DTU genesin both cases. DRIMSeq and dexSeq achieve FDR rates between (0 . , .

30) but DRIMSeq alsoachieves larger True Positive Rates compared to dexSeq. Cuﬀdiﬀ exhibits an almost perfectcontrol of the FDR, at the cost of substantially reduced power. The ROC and precision/recallcurves, shown at Figure 5.(b) and (c) respectively, suggest that cjBitSeq and DRIMSeq areconsistently ranked higher than other methods. Overall, we conclude that cjBitSeq outperformsall other methods.The run-time per method is illustrated in Figure 6, with respect to the maximum amount ofvirtual memory used by each process. For the counting-based methods, the main computationalburden of the two-stage pipeline is due to the ﬁrst stage (that is, either HTSeq or BitSeqVB).21 ll l lll runtime (hours) m a x V m e m ( G B ) l Method:HTSeq/DEXSeq (4)HTSeq/cuffdiff (4)BitSeqVB/DRIMSeq (4)BitSeqVB/BayesDRIMSeq (4)cjBitSeq (4)cjBitSeq (8)cjBitSeq (12) Dataset:human1human2drosophila

Figure 6: Wall clock runtime versus maximum value (in log-scale) of virtual memory used. Thenumber of cores used by each process is shown in parenthesis. For each dataset the total numberof reads is equal to 150 millions.DRIMSeq, edgeR, limma which used BitSeqVB as input exhibit nearly identical computingperformance so only DRIMSeq is shown. Compared to the counting-based methods, cjBitSeqrequires longer computing times, which should be expected given that cjBitSeq performs MCMCsampling on the space of all possible conﬁgurations of each transcript using as input the readalignments. However, note that cjBitSeq is quite eﬃcient with respect to the memory used andthat both memory and computing time vigorously scale with the number of available cores.Therefore, it is suggested to run cjBitSeq using at least 8 cores, since the memory requirementsstay within reasonable levels. Finally, we mention that isoform pre-ﬁltering is also essential forthe computing time of BayesDRIMSeq. In case where no ﬁltering takes place, the wallclock timeis increased almost 2 . . In this section we benchmark the new Bayesian methods against DRIMSeq using real RNA-seqdata from human lung normal and adenocarcinoma samples from six Korean female nonsmoking22 .01 0.025 0.05050010001500 B a y e s DR I M S eq d1 B a y e s DR I M S eq d2 B a y e s DR I M S eq d3 B a y e s DR I M S eq d4 c j B i t S eq d1 c j B i t S eq d2 c j B i t S eq d3 c j B i t S eq d4 DR I M S eq B a y e s DR I M S eq d1 B a y e s DR I M S eq d2 B a y e s DR I M S eq d3 B a y e s DR I M S eq d4 c j B i t S eq d1 c j B i t S eq d2 c j B i t S eq d3 c j B i t S eq d4 DR I M S eq B a y e s DR I M S eq d1 B a y e s DR I M S eq d2 B a y e s DR I M S eq d3 B a y e s DR I M S eq d4 c j B i t S eq d1 c j B i t S eq d2 c j B i t S eq d3 c j B i t S eq d4 DR I M S eq method n mutual noyes (a) 6 normal versus 6 cancer samples B a y e s DR I M S eq d1 B a y e s DR I M S eq d2 B a y e s DR I M S eq d3 B a y e s DR I M S eq d4 c j B i t S eq d1 c j B i t S eq d2 c j B i t S eq d3 c j B i t S eq d4 DR I M S eq B a y e s DR I M S eq d1 B a y e s DR I M S eq d2 B a y e s DR I M S eq d3 B a y e s DR I M S eq d4 c j B i t S eq d1 c j B i t S eq d2 c j B i t S eq d3 c j B i t S eq d4 DR I M S eq B a y e s DR I M S eq d1 B a y e s DR I M S eq d2 B a y e s DR I M S eq d3 B a y e s DR I M S eq d4 c j B i t S eq d1 c j B i t S eq d2 c j B i t S eq d3 c j B i t S eq d4 DR I M S eq method n mutual noyes (b) 3 normal versus 3 normal samplesFigure 7: Inferred number of genes with DTU ( n ) at level α ∈ { . , . , . } for the com-parison of 6 control and 6 tumor samples and null comparisons of 3 versus 3 control samples.For the null comparisons no diﬀerential splicing is expected. For the Bayesian methods cjBitSeqand BayesDRIMSeq all 4 decision rules are used. Green color corresponds to the number ofDTU genes detected by each method that overlap with DRIMSeq and red corresponds to theopposite case. 23atients [Kim et al., 2013]. The data corresponds to samples from GSM927308 to GSM927319and was downloaded from NCBIs Gene Expression Omnibus (GEO) under the accession num-ber GSE37764: SRR493937, SRR493939, SRR493941, SRR493943, SRR493945, SRR493947,SRR493949, SRR493951, SRR493953, SRR493955, SRR493957, SRR493959.The data consist of paired-end reads with length equal to 78 base pairs which were mapped tothe reference transcriptome using Bowtie2. The overall alignment rates and the total number ofmapped reads range between (70% , × , × ), respectively. Next, BitSeq wasused in order to calculate the matrix of alignment probabilities (as input to cjBitSeq) as well as toobtain a matrix of estimated counts per transcript (as input to DRIMSeq and BayesDRIMSeq).Following Nowicka and Robinson [2016], we benchmark our methods using two comparisons:(a) a two-group comparison of 6 normal versus 6 cancer samples and (b) “mock” comparisonswhere 3 versus 3 samples from the normal condition are compared. For the latter scenario theexpectation is to detect no DTU since replicates of the same condition are compared, althoughthe biological variation between the replicates of the normal condition is high [as noted byNowicka and Robinson, 2016]. The results are displayed in Figure 7, using diﬀerent cutoﬀvalues for controlling the FDR. For the 6 normal versus 6 cancer samples comparison (Figure7.a), we conclude that all decision rules contain a large amount of genes which overlap withDRIMSeq (green colored regions), especially for the trust-region adjusted rules d and d . Forthe “mock” comparison (Figure 7.b), at ﬁrst note that a smaller number of DTU genes is inferred.Second, observe that the decision rule d is capable of substantially reducing the number of falsediscoveries compared to DRIMSeq and that this number is almost zero when using α = 0 . In this study we exempliﬁed the use of Bayesian methods for inferring genes with diﬀerentialtranscript usage. For this purpose two previously introduced models were modiﬁed and ex-tended: cjBitSeq and a Bayesian version of DRIMSeq. After deﬁning proper decision rules weconcluded that both methods exhibit superior or comparable performance with other methods.This was achieved by using the decision rule deﬁned in Equation (11), shown in the ROC and24recision-recall curves. According to (11), the whole sequence of posterior probabilities is trans-formed with respect to the ordering of the magnitude change of relative expression betweenconditions. For the read-based method (cjBitSeq) FDR control is improved when the decisionrule is combined with a trust region. For the count-based method (BayesDRIMSeq) FDR con-trol is mainly aﬀected by the ﬁltering of low-expressed transcripts, as previously reported undera frequentist context by Soneson et al. [2015]. BayesDRIMSeq exhibits slightly better FDRcontrol than cjBitSeq for the drosophila dataset, however this eﬀect is not so evident for thehuman datasets. In all cases cjBitSeq is more powerful than BayesDRIMSeq, but at the cost ofincreased computing time.Regarding the analysis of real RNA-seq data, we compared our ﬁndings to DRIMSeq. Wereported results based on a comparison of two diﬀerent conditions, as well as “mock” comparisonsof replicates within the same condition where no evidence of diﬀerential expression is expected.We concluded that our DTU lists contain a large number of genes also detected by DRIMSeq.Moreover, using conservative decision rules like d we are able to substantially reduce the numberof false discoveries when performing comparisons within the same condition.The methods are freely available from https://github.com/mqbssppe/cjBitSeq (cjBitSeq)and https://github.com/mqbssppe/BayesDRIMSeq (BayesDRIMSeq). The source code forgenerating the simulated datasets of Soneson et al. [2015] is available from https://github.com/markrobinsonuzh/diff splice paper . Acknowledgements

The research was supported by MRC award MR/M02010X/1, BBSRC award BB/J009415/1and EU FP7 project RADIANT (grant 305626). The authors would like to acknowledge theassistance given by IT Services and the use of the Computational Shared Facility at The Univer-sity of Manchester. Regarding BayesDRIMSeq and replication of simulations, helpful discussionswith Mark Robinson, Malgorzata Nowicka and Charlotte Soneson (Institute of Molecular LifeSciences, University of Zurich) are gratefully acknowledged.25 .00 0.04 0.08 0.12 . . . . . . . achieved FDR T r ue P o s i t i v e lllllllllll lambda 0.01lambda 0.1lambda 0.2lambda 0.3lambda 0.4lambda 0.5lambda 0.6lambda 0.7lambda 0.8lambda 0.9lambda 1 0.00 0.05 0.10 0.15 0.20 . . . . . . . achieved FDR T r ue P o s i t i v e lllllllllll lambda 0.01lambda 0.1lambda 0.2lambda 0.3lambda 0.4lambda 0.5lambda 0.6lambda 0.7lambda 0.8lambda 0.9lambda 1 0.00 0.05 0.10 0.15 0.20 . . . . . . . achieved FDR T r ue P o s i t i v e lllllllllll lambda 0.01lambda 0.1lambda 0.2lambda 0.3lambda 0.4lambda 0.5lambda 0.6lambda 0.7lambda 0.8lambda 0.9lambda 1 (a) (b) (c)Figure 8: Prior sensitivity of BayesDRIMSeq with respect to λ . A Prior Sensitivity of BayesDRIMSeq

According to Equation (6), the prior assumptions of BayesDRIMSeq are depending on the ﬁxedhyperparameter λ . Figure 8 displays the power versus achieved FDR curves based on the decision d as a function of λ ∈ { . , . , . , . . . , } (after isoform pre-ﬁltering). We conclude that thevalue λ = 0 . . . λ = 0 . B Using Kallisto counts

In the main text we used BitSeqVB count estimates as input to DRIMSeq and BayesDRIMSeq.According to the recent study of Hensman et al. [2015], BitSeqVB is ranked as one of themost accurate methods for estimating transcript expression levels. Since there is a variety ofalternative methods for this purpose, we compare the performance when Kallisto [Bray et al.,2016] counts are being used as input. As shown in Figure 9, we conclude that in drosophila databoth BayesDRIMSeq and DRIMSeq perform better when BitSeqVB counts are used. Howeverthere is no clear ordering in the human datasets: in both cases BitSeqVB counts correspond toincreased power but at the cost of slightly worse FDR calibration. Finally, ROC and precision-26ecall curves suggest that BitSeqVB leads to slightly increased performance for both methods.

References