[PDF] Differential Expression Analysis of Dynamical Sequencing Count Data with a Gamma Markov Chain

Abstract

Next-generation sequencing (NGS) to profile temporal changes in living systems is gaining more attention for deriving better insights into the underlying biological mechanisms compared to traditional static sequencing experiments. Nonetheless, the majority of existing statistical tools for analyzing NGS data lack the capability of exploiting the richer information embedded in temporal data. Several recent tools have been developed to analyze such data but they typically impose strict model assumptions, such as smoothness on gene expression dynamic changes. To capture a broader range of gene expression dynamic patterns, we develop the gamma Markov negative binomial (GMNB) model that integrates a gamma Markov chain into a negative binomial distribution model, allowing flexible temporal variation in NGS count data. Using Bayes factors, GMNB enables more powerful temporal gene differential expression analysis across different phenotypes or treatment conditions. In addition, it naturally handles the heterogeneity of sequencing depth in different samples, removing the need for ad-hoc normalization. Efficient Gibbs sampling inference of the GMNB model parameters is achieved by exploiting novel data augmentation techniques. Extensive experiments on both simulated and real-world RNA-seq data show that GMNB outperforms existing methods in both receiver operating characteristic (ROC) and precision-recall (PR) curves of differential expression analysis results.

Full PDF

DDiﬀerential Expression Analysis of DynamicalSequencing Count Data with a Gamma Markov Chain

Ehsan Hajiramezanali , Siamak Zamani Dadaneh , Paul de Figueiredo ,Sing-Hoi Sze , Mingyuan Zhou , and Xiaoning Qian March 8, 2018

Abstract

Next-generation sequencing (NGS) to proﬁle temporal changes in living systemsis gaining more attention for deriving better insights into the underlying biologicalmechanisms compared to traditional static sequencing experiments. Nonetheless,the majority of existing statistical tools for analyzing NGS data lack the capa-bility of exploiting the richer information embedded in temporal data. Severalrecent tools have been developed to analyze such data but they typically imposestrict model assumptions, such as smoothness on gene expression dynamic changes.To capture a broader range of gene expression dynamic patterns, we develop thegamma Markov negative binomial (GMNB) model that integrates a gamma Markovchain into a negative binomial distribution model, allowing ﬂexible temporal vari-ation in NGS count data. Using Bayes factors, GMNB enables more powerful Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX77843, USA, Center for Bioinformatics & Genomic Systems Engineering, Texas A&M University, College Station,TX 77843, USA, Department of Microbial Pathogenesis and Immunology, Texas A&M Health Science Center, CollegeStation, TX 77843, USA, Department of Veterinary Pathobiology, and Norman Borlaug Center, Texas A&M University, 77843,USA, Department of Computer Science and Engineering, and Department of Biochemistry & Biophysics,Texas A&M University, College Station, TX 77843, USA, Department of Information, Risk, and Operations Management (IROM), The University of Texas atAustin, Austin, TX 78712, USA. * Correspondence should be addressed to Xiaoning Qian ( [email protected] ). a r X i v : . [ s t a t . A P ] M a r emporal gene diﬀerential expression analysis across diﬀerent phenotypes or treat-ment conditions. In addition, it naturally handles the heterogeneity of sequencingdepth in diﬀerent samples, removing the need for ad-hoc normalization. EﬃcientGibbs sampling inference of the GMNB model parameters is achieved by exploit-ing novel data augmentation techniques. Extensive experiments on both simulatedand real-world RNA-seq data show that GMNB outperforms existing methods inboth receiver operating characteristic (ROC) and precision-recall (PR) curves ofdiﬀerential expression analysis results. Keywords: Gamma Markov chain, negative binomial, Gibbs sampling, temporalRNA-seq data, Bayes factor

Advances in next-generation sequencing (NGS) techniques have enabled researchers toproduce millions of relatively short reads for genome-scale bioinformatics research [Con-sortium et al., 2012, Wang et al., 2009, Mortazavi et al., 2008]. Transcriptome analyses,including gene expression proﬁling and transcript quantiﬁcation through RNA-sequencing(RNA-seq), can help better understand biological processes of interest. RNA-seq countdata are highly over-dispersed with large dynamic ranges [Anders et al., 2015]. A largenumber of statistical tools have been developed for diﬀerential gene expression analysis ofRNA-seq data [Anders and Huber, 2010, Dadaneh et al., 2017, Robinson et al., 2010, Loveet al., 2014, Law et al., 2014, Hardcastle and Kelly, 2010, Leng et al., 2013], which mostlyhave adopted the negative binomial (NB) distribution to account for over-dispersion aswell as high uncertainty inherent in RNA-seq data due to the small number of replicatesamples in typical diﬀerential expression experiments [Love et al., 2014].Living systems are complex and dynamic. There has been signiﬁcant interest in an-alyzing temporal RNA-seq count data [Bar-Joseph et al., 2012]. For example, in cellbiology or drug discovery research, monitoring molecular expression changes in responseto speciﬁc stimuli can help better understand cellular mechanisms at the transcriptional2nd post-transcriptional regulatory levels under diﬀerent conditions. One important taskis to identify the genes that are diﬀerentially expressed over time across diﬀerent con-ditions, which is more challenging compared to static RNA-seq data analysis due topotential temporal dependencies [Lienau et al., 2009].Recently, several dynamic diﬀerential RNA-seq analysis methods have been devel-oped to better capture temporal dependency. For example, EBSeq-HMM [Leng et al.,2015] takes an empirical Bayesian mixture modeling approach to compare the expressionchange across consecutive time points to identify genes that display signiﬁcant transcrip-tion changes over time under one treatment condition. Across diﬀerent conditions, itis desirable to identify genes that have diﬀerent dynamic patterns. For this purpose,next-maSigPro [Nueda et al., 2014] has extended a generalized linear model (GLM) [Mc-Cullagh, 1984] based dynamic diﬀerential expression analysis for microarray data frommultiple time points to analyze temporal RNA-seq data. However, modeling RNA-seqcounts by real values may lead to information loss and GLM may not be able to capturecomplicated dynamic changes in expression. An autoregressive time-lagged AR (1) modelwith Markov Chain Monte Carlo (MCMC) inference [Oh et al., 2013] has also been pro-posed to identify genes with diﬀerent temporal expression changes. But the posteriorestimates of model parameters through Metropolis-Hastings inference lead to high com-putational complexity. DyNB [ ¨Aij¨o et al., 2014] has been proposed recently to modelthe temporal RNA-seq counts by NB distributions with their temporal expected valuesmodeled by non-parametric Gaussian Processes (GP). DyNB can detect the genes withdiﬀerential dynamic patterns that static diﬀerential expression analysis, which considerindividual time points, fail to discover. In addition to high computational complexity dueto MCMC inference [Spies and Ciaudo, 2015, Sun et al., 2016], DyNB may fail to modelpotential abrupt expression changes due to its inherent smoothness assumptions [Ras-mussen and Williams, 2006].We present a new dynamic diﬀerential expression analysis method for temporal RNA-3eq data, GMNB (gamma Markov negative binomial), which is a hierarchical model tointroduce a gamma Markov chain [Acharya et al., 2015, Schein et al., 2016] to model thepotential dynamic transitions of the model parameters in NB distributions. With thisnew model for temporal RNA-seq data and an eﬃcient inference algorithm, GMNB isexpected to provide the following advantages over existing methods: 1) GMNB can modelmore general dynamic expression patterns than DyNB, especially for abrupt expressionchanges across consecutive time points; 2) The closed-form Gibbs sampling can be de-rived to infer the model parameters in GMNB, which is computationally more eﬃcientthan the existing methods; 3) For dynamic diﬀerential expression, genes are ranked basedon the Bayes factor (BF), which is very general especially when considering diﬀerentialexpression under multiple factors; 4) Last but not least, GMNB avoids the normalizationpreprocessing step due to the explicit modeling of the sequencing depth in NB distribu-tions, as described in Dadaneh et al. [2017], and we expect similar superior performanceof GMNB compared to existing methods requiring such heuristic preprocessing steps.The remainder of the paper is organized as follows. Section 2 introduces the GMNBmodel, inference algorithm, and dynamic diﬀerential expression analysis. Section 3 com-pares the experimental results from both synthetic and real-world benchmark data usingGMNB and other state-of-the-art dynamic diﬀerential expression methods for temporalRNA-seq data. We conclude the paper in Section 4. Throughout this paper, we use the NB distribution to model RNA-seq read counts. Weparameterize a NB random variable as n ∼ NB( r, p ), where r is the nonnegative dispersionand p is the probability parameter. The probability mass function (pmf) of n is expressedas f N ( n ) = Γ( n + r ) n !Γ( r ) p n (1 − p ) r , where Γ( · ) is the gamma function. The NB random variable4 ∼ NB( r, p ) can be generated from a compound Poisson distribution: n = (cid:96) (cid:88) t =1 u t , u t ∼ Log( p ) , (cid:96) ∼ Pois( − r ln(1 − p )) , where u ∼ Log( p ) corresponds to the logarithmic random variable [Johnson et al., 2005],with the pmf f U ( u ) = − p u u ln(1 − p ) , u = 1 , , ... . As shown in Zhou and Carin [2015],given n and r , the distribution of (cid:96) is a Chinese Restaurant Table (CRT) distribution,( (cid:96) | n, r ) ∼ CRT( n, r ), a random variable from which can be generated as (cid:96) = (cid:80) nt =1 b t ,with b t ∼ Bernoulli( rr + t − ). We model the dynamic gene expression changes in a temporal RNA-seq dataset by con-structing a Markov chain where the expression of a gene at time t only depends on thatof time t −

1. Speciﬁcally, for the RNA-seq reads mapped to gene k in a given sample j under diﬀerent conditions, the read count at time t follows: n ( t ) kj ∼ NB( r ( t ) k , p ( t ) j ) , (1)where to impose the dependence between consecutive time points, we model the dis-persion parameters dynamically by introducing a gamma Markov chain, in which r ( t ) k isdistributed according to: r ( t ) k ∼ Gamma( r ( t − k , c k ) . (2)As previously shown in Dadaneh et al. [2017], the probability parameter p ( t ) j accountsfor the eﬀect of varying sequencing depth of sample j at time point t . More precisely, theexpected expression of gene k in sample j and time t is r ( t ) k p ( t ) j − p ( t ) j , and hence the dispersionparameter r ( t ) k can be viewed as the true abundance of gene k at time t , after removingthe eﬀects of sequencing depth. Thus the diﬀerential expression analysis of temporal5NA-seq data can be performed without any normalization preprocessing steps.Note that the scale parameter 1 /c k of the Gamma distribution in (2) is shared betweendiﬀerent time points, thereby making statistical inference more robust by borrowinginformation from various samples at multiple time points. To complete the model wesample the dispersion parameter at the ﬁrst time point as r (0) k ∼ Gamma( e (0) , f ), anduse conjugate priors as c k ∼ Gamma( c , d ) and p ( t ) j ∼ Beta( a , b ).In addition to the ﬂexibility of modeling temporal RNA-seq data, this GMNB modelenables an eﬃcient inference procedure by taking advantage of unique data augmenta-tion and marginalization techniques for the NB distribution [Zhou and Carin, 2015], asdescribed in detail below. By exploiting novel data augmentation techniques in Zhou and Carin [2015], we im-plement an eﬃcient Gibbs sampling algorithm with closed-form updating steps. Morespeciﬁcally, we infer the dispersion parameter of the NB distribution by ﬁrst drawinglatent random counts from the CRT distribution, and then update the dispersion by em-ploying the gamma-Poisson conjugacy. Furthermore, due to the Markovian constructionof the model, it is necessary to consider both backward and forward ﬂow of informationfor the inference of r ( t ) k . First, in the backward stage, starting from the last time point t = T , we draw two sets of auxiliary random variables as l ( t ) kj ∼ CRT( n ( t ) kj , r ( t ) k ) l ( t ) k. = (cid:88) j l ( t ) kj u ( t − t ) k ∼ CRT( u ( t )( t +1) k + l ( t ) k. , r ( t − k ) , (3)6or t = T, T − , . . . ,

1. For the last time point, we assume u ( T )( T +1) k = 0. Next, in theforward stage of Gibbs sampling, we sample r ( t ) k starting from t = 0 to t = T as( r ( t ) k |− ) ∼ Gamma (cid:0) r ( t − k + u ( t )( t +1) k + l ( t ) k. , θ ( t ) k (cid:1) , (4)where θ ( t ) k = c k − (cid:80) j ln(1 − p ( t ) j ) − ln (1 − q ( t ) k ). For t = 0 , ..., T − q ( t ) k is deﬁned as q ( t ) k = − (cid:80) j ln(1 − p ( t +1) j ) − ln(1 − q ( t +1) k ) c k − (cid:80) j ln(1 − p ( t +1) j ) − ln(1 − q ( t +1) k ) , (5)and q ( T ) k = 0. Finally, by taking advantage of conjugate priors, in each iteration of Gibbssampling, c k and p ( t ) j can be drawn as( c k |− ) ∼ Gamma( e + T − (cid:88) t =0 r ( t ) k , / ( f + T (cid:88) t =1 r ( t ) k )) , ( p ( t ) j |− ) ∼ Beta( a + (cid:88) k n ( t ) kj , b + (cid:88) k r ( t ) k ) . (6)The eﬃcient augmentation technique employed in our Gibbs sampling inference re-moves the need for specifying a suitable proposal distribution, as in the Metropolis-Hastings inference of both DyNB [ ¨Aij¨o et al., 2014] and NB-AR(1) methods [Oh et al.,2013]. Our experiments in the next section demonstrate that the Gibbs sampling algo-rithm of GMNB has fast convergence. The main goal of diﬀerential expression analysis is to identify the genes whose expressionsdemonstrate signiﬁcant variations across conditions. In the classic static RNA-seq dataanalysis, this goal is usually obtained via the comparison of expression averages acrossgroups. In dynamic RNA-seq measurement settings, however, this task becomes morechallenging as any change of temporal expression patterns between groups may reﬂect7nteresting biological mechanisms. Hence, as in ¨Aij¨o et al. [2014], we adopt the BayesFactor (BF) as a measure that exploits information collectively from all time pointsto detect the genes with signiﬁcant variations in temporal expression patterns acrossconditions.To compute the Bayes Factor, we ﬁrst consider the null hypothesis H that the genesare not diﬀerentially expressed across conditions, and thus the same set of parametersgovern the temporal gene expressions. In this case, we aggregate the counts D of bothexperimental conditions to ﬁt the GMNB model M . On the other hand, under thealternative hypothesis H , the diﬀerentially expressed genes possess diﬀerent model pa-rameters in each group. Hence, GMNB models M and M are independently ﬁtted to thecounts in conditions 1 ( D ) and 2 ( D ), respectively. Then, the BF can be calculated asBF = P ( D| H ) P ( D| H ) = P ( D | M ) P ( D | M ) P ( D| M ) , where we have assumed equal prior probabilities for both hypotheses. The BF compu-tation requires marginalizing out model parameters, which we conduct through MonteCarlo integration using posterior samples collected in the Gibbs sampling procedure. We evaluate the proposed GMNB model and compare its performance on both syntheticand real-world temporal RNA-seq data with DyNB [ ¨Aij¨o et al., 2014]. We also considerDESeq2 [Love et al., 2014], which is a popular tool for diﬀerential expression analysis,however, not speciﬁcally designed for temporal RNA-seq data. We ﬁrst consider syntheticRNA-seq data generated by diﬀerent temporal models, and show that GMNB consistentlyprovides outstanding performance in terms of the area under the curves (AUCs) of receiveroperating characteristic (ROC) and precision-recall (PR) curves. Furthermore, we presenttwo case studies on human Th17 cell diﬀerentiation [Tuomela et al., 2016, Chan et al.,8016, ¨Aij¨o et al., 2014], and explain the biomedical implications based on diﬀerentialexpression analysis over time by GMNB.Throughout the experimental studies for synthetic and real-world data, for GMNB,in each run of Gibbs sampling inference 1000 MCMC samples of parameters are collectedafter 1000 burn-in iterations. We use the collected MCMC samples to calculate the BFfor each gene as explained in Section 2.4, and rank the genes according to these BFs.For DyNB, we follow the settings provided in ¨Aij¨o et al. [2014] and rank the genes usingthe computed BFs. We consider three diﬀerent setups for diﬀerential expression analysisof temporal RNA-seq data using DESeq2. In the ﬁrst setup, denoted by DESeq2-GLMin the experiments, time information is incorporated as a covariate of the generalizedlinear model in DESeq2 in diﬀerential expression analysis to determine temporal data inone model. In the second and third setups, we apply DESeq2 to the data at diﬀerenttime points independently, and use the average and minimum computed p-values fromthe respective diﬀerential expression analyses as an overall measure of diﬀerential ex-pression across conditions, denoted by DESeq2-avg and DESeq2-min in the experiments,respectively.It is worth mentioning that the use of an eﬃcient closed-form Gibbs sampling makesGMNB, on average, 10 times faster than DyNB for both simulated and real-world tem-poral RNA-seq datasets by reducing the number of iterations required to converge. Thisis due to the low acceptance rate of the Metropolis-Hastings step of DyNB inference.Thus, to ensure the convergence of its MCMC inference, we consider performing 100 , We ﬁrst perform a comprehensive evaluation of GMNB with the synthetic data generatedunder diﬀerent temporal RNA-seq models. More precisely, we simulat the data under the9ollowing three diﬀerent setups: the proposed GMNB generative model, the DyNB gen-erative model [ ¨Aij¨o et al., 2014], and the auto-regressive (AR) based procedure [Oh et al.,2013]. In all setups 10% of genes are randomly set to be truly diﬀerentially expressed,with the procedure described in detail for each setup in the following subsections. Foreach speciﬁc generative model, we change the corresponding model parameters to ensurethat the expected expression changes of truly diﬀerentially expressed genes are diﬀerentacross two conditions. The impact of sequencing depth variation is simulated by drawingthe corresponding size factors from the interval [0 . , .

2] uniformly at random.

In the ﬁrst simulation study, we generate the synthetic RNA-seq count data for 1000genes under two conditions according to the GMNB model (1) with the gamma-Markovtemporal dependencies (2) between dispersion parameters. The gene-wise scale parame-ters c k are drawn from the uniform distribution in the interval [0 . , c k + b , where b =  .

02 if c k < − .

02 if c k ≥ r (0) k , is generated for both conditions according toGamma( e ,

10) where e = Uniform(30 , . , . Recall P rec i s i o n GMNB, AUC = 0.65DyNB, AUC = 0.27DESeq2-min, AUC = 0.57DESeq2-avg, AUC = 0.46DESeq2-GLM, AUC = 0.09 0 0.2 0.4 0.6 0.8 1

False Positive Rate T r u e P o s i t i v e R a t e GMNB, AUC = 0.87DyNB, AUC = 0.70DESeq2-min, AUC = 0.84DESeq2-avg, AUC = 0.80DESeq2-GLM, AUC = 0.46

Figure 1:

Left column:

PR Curve,

Right column:

ROC Curve. Performance compar-ison of diﬀerent methods for diﬀerential gene expression over time based on the GMNBgenerative model. AUCs are given in the corresponding legends in the plots.

In the second simulation study, data is generated according to the DyNB model assump-tions. More speciﬁcally, we draw the true mean values µ k , for 1000 genes from a Gaussianprocess with the mean m k and the covariance matrix Cov ( t i , t j ) = θ k exp( − α k | t i − t j | ),where m k , θ k and α k are uniformly distributed in the intervals [1000 , , . , m k and covari-ance function parameters { θ k , α k } to { bm k , cθ k , α k ± d } , where b = 1 . c = 10, and d = 0 .

25 determine the signiﬁcance of expected expression changes across conditions.Similar to the previous simulation setup, 4 replicates are generated for each time pointin the corresponding condition.Figure 2 demonstrates the performances of diﬀerent methods applied to the datagenerated according to the above procedure. GMNB still clearly outperforms the othermethods based on both ROC and PR curves. The inferior performance of DyNB may bedue to the small number of replicates for each time point, leading to poor estimates ofboth θ k (heuristically estimated by the data dependent value 10 × stdev( Y ) based on theobserved replicates Y = { y , . . . , y J } in ¨Aij¨o et al. [2014]) and µ k (heuristically estimatedby min ( Y )+ max ( Y )2 in ¨Aij¨o et al. [2014]). On the contrary, the fully Bayesian nature ofGMNB makes its performance robust to the number of replicates as well as potentialnoise at each time point. Both variates of static DESeq2 under-perform the dynamic ap-proaches remarkably, as they neglect the correlation between samples across time points.In addition, DESeq2-GLM under-performs both GMNB and DyNB substantially, as itneglects the inherent dynamics of RNA-seq experiments speciﬁcally. To further demon-strate the potential inﬂuence of gene expression levels, reﬂected by the expected readcounts, on the detection power of diﬀerentially expressed genes, additional simulationsare performed with the same parameters as above, except for the mean parameter m k ,for which three sampling uniform distributions are tested with the intervals [1000 , , , Recall P rec i s i o n GMNB, AUC = 0.78DyNB, AUC = 0.71DESeq-min, AUC = 0.52DESeq-avg, AUC = 0.59DESeq-GLM, AUC = 0.16 0 0.2 0.4 0.6 0.8 1

False Positive Rate T r u e P o s i t i v e R a t e GMNB, AUC = 0.92DyNB, AUC = 0.89DESeq-min, AUC = 0.86DESeq-avg, AUC = 0.85DESeq-GLM, AUC = 0.57

Figure 2:

Left column:

PR Curve,

Right column:

ROC Curve. Performance compar-ison of diﬀerent methods for diﬀerential gene expression over time based on the DyNBgenerative model. AUCs are given in the corresponding legends in the plots.no matter when we have low or high level counts. It is also noticeable that DyNB andDESeq2-min are more sensitive with variable performances across 20 randomly generateddatasets. This indicates that GMNB better borrows signal strengths across time pointscompared to DyNB and DESeq2-min.

In addition to synthetic data based on the GMNB and DyNB models, we evaluate thesemethods with the simulated data based on the NB-AR(1) model [Oh et al., 2013]. Moreprecisely, the count for gene k at time t is distributed according to a NB distribution whosemean parameter satisﬁes log( µ ( t ) k ) = ω ( t ) k + β k . Here β k follows the uniform distributionin [4 . , .

5] to test the temporal diﬀerential expression performance with low read counts.The parameter ω ( t ) k is obtained through an auto-regressive process φ k ω ( t − k + (cid:15) ( t ) , where φ k is randomly generated from the uniform distribution in [0 . , . (cid:15) ( t ) is a standard13ero-mean white noise process. Similar to the previous two simulation models, read countsare generated for 1000 genes and 10% of them selected to be diﬀerentially expressed bychanging the parameter φ k to bφ k for the second condition, where b =  / φ k ≤ . / φ k > . φ k that controls the temporal dependence of gene expression. However, thetemporal correlation assumptions of the Gaussian process, diﬀerent from this generativemodel, makes it less powerful to identify all diﬀerentially expressed genes. The resultsin Figure 3 demonstrate the higher power of GMNB in detecting temporal diﬀerentialexpression patterns, especially with low expression levels (read counts are approximately150 here). Similar to the DyNB generative model, we compare the performance of GMNBwith DyNB and DESeq2-min (top performing methods) with diﬀerent expected counts. InFigure 4, additional simulations are performed with the same parameters as above, exceptthe uniform distribution for the β k with three varying intervals: [4 . , . . , . . , . Recall P rec i s i o n GMNB, AUC = 0.52DyNB, AUC = 0.37DESeq2-min, AUC = 0.43DESeq2-avg, AUC = 0.39DESeq2-GLM, AUC = 0.28 0 0.2 0.4 0.6 0.8 1

False Positive Rate T r u e P o s i t i v e R a t e GMNB, AUC = 0.84DyNB, AUC = 0.66DESeq2-min, AUC = 0.80DESeq2-avg, AUC = 0.75DESeq2-GLM, AUC = 0.69

Figure 3:

Left column:

PR Curve,

Right column:

ROC Curve. Performance compari-son of diﬀerent methods for diﬀerential gene expression over time based on the NB-AR(1)generative model. AUCs are given in the corresponding legends in the plots.erative models, which can have abrupt non-smooth changes. In addition, the heuristicestimation of model parameters adopted in DyNB [ ¨Aij¨o et al., 2014] when the number ofreplicates is low can be the other reason for the degraded performance.In summary, on synthetic RNA-seq count data from diﬀerent generative models, com-parison of both the ROC and PR curves shows that GMNB outperforms both the recentlyproposed temporal (DyNB) and static diﬀerential analysis methods that aggregate dif-ferential statistics in heuristic ways (DESeq2 with diﬀerent setups). Table 1 summarizesthe average AUCs and their standard deviation values of both ROC and PR curves for20 randomly generated synthetic datasets by the top three performing methods (GMNB,DyNB, and DESeq2-min). GMNB improves the performances of DyNB and DESeq2-min,in terms of AUC-PR, at least by 23% and 17%, respectively. In the best case scenario,GMNB improves the AUC-PR performances of DyNB and DESeq2-min up to 48% and71%, respectively. In terms of AUC-ROC, GMNB improves the best case performances15able 1: Comparison of AUCs based on 20 runs for each method.

AUC Generative Model GMNB DyNB DESeq2-minGMNB 0.84 ± ± ± ROC DyNB 0.94 ± ± ± NB-AR(1) 0.81 ± ± ± GMNB 0.61 ± ± ± PR DyNB 0.79 ± ± ± NB-AR(1) 0.51 ± ± ± To further illustrate how GMNB may help identify diﬀerentially expressed genes fromtemporal RNA-seq data for biologically signiﬁcant results, we provide such a case studyconsisting of 57 human samples during the priming of T helper 17 (Th17) cell diﬀer-entiation [Tuomela et al., 2012]. The main goal of designing this case study is to gaininsights into the diﬀerentiation process by unraveling dependency between diﬀerent ge-netic factors in various pathways, which may serve as potential biomarkers of immuno-logical diseases for therapeutic intervention design. In this dataset [Tuomela et al., 2016],at 0 , . , , , , , , ,

48, and 72 hours of Th17 polarized cells and control Th0 cells,three biological replicates were collected for transcript proﬁling by RNA-seq. The data16

500 600 150

AUC − P R . . . . . . GMNBDyNBDESeq2−min

AUC − R O C . . . . . . DyNB generative model

450 250 150

AUC − P R . . . . . .

450 250 150

AUC − R O C . . . . . . NB−AR(1) generative model

Figure 4: AUC comparison of diﬀerent methods for diﬀerential gene expression analysisover time in low counts.were downloaded from Gene Omnibus with the accession number GSE52260 [Tuomelaet al., 2016, Chan et al., 2016].When checking the 10 most diﬀerentially expressed genes based on their BFs byGMNB, all of them have been reported to be diﬀerentially expressed in other studiesinvestigating Th17 cell diﬀerentiation. Among them, the top diﬀerentially expressed geneis thrombospondin-1 (TSP1), whose encoded protein participates in the diﬀerentiationof Th17 cells by activating transforming growth factor beta (TGF- β ) and enhancingthe inﬂammatory response in experimental autoimmune encephalomyelitis (EAE) [Yang17t al., 2009]. The second gene in the list is Lymphotoxin α (LTA), a member of the tumornecrosis factor (TNF) superfamily that is both secreted and expressed on the cell surfaceof activated Th17 cells [Chiang et al., 2009]. The third gene, COL6A3, contributes toadipose tissue inﬂammation [Pasarica et al., 2009] and responds quickly to Th17 cellpolarizing stimulation [Tripathi et al., 2017]. The gene Cathepsin L (CTSL1) is rankedas the fourth in the list and is linked to the regulation of immune responses at the level ofMHC complex maturation and Ag presentation inﬂuencing diﬀerentiation of CD4+ cellsand autoimmune reactions [Reiser et al., 2010]. The ﬁfth gene, FURIN, has been reportedas a T cell activation gene that regulates the T helper cell balance of the immune system[Pesu et al., 2008]. The sixth gene lamin A (LMNA) has been identiﬁed as one of theimmune response regulators [Gonz´alez-Granado et al., 2014]. The seventh gene, FilaminA (FLNA), is required for T cell activation [Hayashi and Altman, 2006]. The eighthgene, SBNO2, has been reported to inﬂuence Th17 cell diﬀerentiation [Tripathi et al.,2017]. Zhao et al. [2014] observed signiﬁcant changes in the expression of the ninth geneACTB in activated T cells. Finally, the tenth gene Notch1 is activated in both mouseand human in vitro-polarized Th17 cells and also in Th17 polarized cells as comparedwith Th0 control cells [Keerthivasan et al., 2011].We then investigate how the results of DyNB diﬀer from those of GMNB. The majorityof the above genes are indeed ranked relatively high by DyNB as diﬀerentially expressed,except two genes: FLNA and ACTB. For these two genes, their expression levels changeabruptly after 12 hours of T17 diﬀerentiation. These two genes demonstrate that theDyNB method may fail to detect temporal diﬀerential expression when the temporalgene expression trends are not smooth. As an instance, Figure 5 illustrates that DyNBis not able to capture the temporal expression changes of gene FLNA accurately. Moreprecisely, Figure 5(a) shows the posterior means of expected gene expression µ k based onDyNB and their corresponding conﬁdence intervals, where circles and diamonds representthe normalized counts from Th0 and Th17 lineages, respectively. To further assess the18ower of the models in reproducing the observed gene counts, for each model, we generate1000 gene counts per sample and time point based on the inferred parameters, andthen calculate the 99% conﬁdence interval using these synthetically generated counts.Figure 5(b) demonstrates the means and conﬁdence intervals of the counts generated viathis procedure for DyNB, where the circles and diamonds represent the observed rawcounts from Th0 and Th17 lineages, respectively. Similar to plots in Figures 5(a) and5(b), we perform the same examinations on expression pattern of FLNA by the GMNBmodel. To demonstrate the expression levels of the k th gene between two groups, theDyNB compares the posterior NB mean parameters µ k , whereas the GMNB compares theposterior NB shape parameters r k . One may consider that the expression level of gene k is assumed to roughly follow a function of the shape parameter r k in the GMNB, but theobserved counts should be demonstrated in a same scale as the shape parameter. Thediﬀerence between the posterior shape parameters r k explains the diﬀerences betweenthe means, since if n ( t ) kj ∼ NB( r ( t ) k , p ( t ) j ), then E [ n ( t ) kj ] = r ( t ) k p ( t ) j / (1 − p ( t ) j ). Therefore,Figure 5(c) shows the posterior means of r k based on GMNB and their correspondingconﬁdence intervals, where the circles and diamonds are obtained by dividing the observedcounts by the parameter p ( t ) j / (1 − p ( t ) j ) representing the sequencing depth in the proposedmodel. Additionally, Figure 5(d) demonstrates the means and conﬁdence intervals forsynthetically generated gene counts based on the inferred parameters of GMNB, wherethe read counts on the y-axis are observed read counts. Not only GMNB improves themodel ﬁtting over 24h to 72h, but also it has more robust estimation of expression patternsfor the starting time points with lower counts (Figures 5(c) and 5(d)). The calculatedBFs for the gene FLNA are 2.3461 and 1 . × by DyNB and GMNB, respectively.GMNB also identiﬁes ACTB as a gene with signiﬁcant diﬀerential temporal expression(BF >

10) but DyNB again fails to capture the abrupt expression changes and therebyassociates low BF (supplement materials). The corresponding temporal expression plotsare depicted in Figure S1 of the supplement materials.19 a) (b)(c) (d)

Figure 5:

Diﬀerentially expressed gene FLNA detected by GMNB but not byDyNB. (a) The normalized gene expression proﬁle of

FLNA over time estimated byDyNB model. The normalization of read counts on the y-axis are obtained by usingthe normalization method of DESeq. The solid blue and red curves are the posteriormeans under Th0 and Th17 lineages, respectively, with corresponding 99% CIs (shadedareas). (b) The gene expression proﬁle of

FLNA over time estimated by DyNB. Theread counts on the y-axis are observed read count. The solid blue and red curves arethe means of the generated samples based on the inferred parameters by DyNB underTh0 and Th17 lineages, respectively, with corresponding 99% CIs (shaded areas aroundmeans). (c) The normalized gene expression proﬁle of

FLNA over time estimated by theproposed GMNB model. The normalization of read counts on the y-axis are obtained bydividing the observed counts by the parameter p ( t ) j / (1 − p ( t ) j ) representing the sequencingdepth in the model. The solid blue and red curves are posterior means of r k under Th0and Th17 lineages, respectively, with corresponding 99% CIs (shaded areas). (d) Thegene expression proﬁle of FLNA over time estimated by GMNB. The read counts onthe y-axis are observed read count. The solid blue and red curves are the means of thegenerated samples based on the inferred parameters by GMNB under Th0 and Th17lineages, respectively, with corresponding 99% CIs (shaded areas around means).20n the other hand, LGALS1, SEPT5, BATF3, COL1A2, and ENO2 are ﬁve genesout of 90 diﬀerentially expressed genes detected by DyNB with BFs 2 . × , 472 . . × , 404 .

34, and 398 .

43, whereas they are associated with BFs lower than 10 byGMNB. Figure 6 illustrates the expression proﬁle of the gene LGALS1 inferred by DyNBand GMNB, indicating that DyNB is not able to ﬁlter out those low count genes forwhich the replicated Th0 and Th17 lineages are seemingly similar and leads to this po-tential false positive. Figure 6(a) shows the posterior means of expected gene expression µ k based on DyNB under Th0 and Th17 lineages with their corresponding conﬁdence in-tervals. Figure 6(b) shows the means and conﬁdence intervals for 1000 generated samplesbased on the inferred parameters of DyNB model. While the normalized counts are plot-ted in Figure 6(a), the circles and diamonds mark Th0 and Th17 lineages, respectively,for the observed raw counts in Figure 6(b). On the contrary, GMNB considers this genenot signiﬁcantly diﬀerentially expressed with similar inferred temporal expression proﬁlesacross conditions, as demonstrated in Figures 6(c) and 6(d). This may be explained bythe fact that GMNB employs a fully generative model of gene expressions, including thesequencing depth, while DyNB uses a deterministic ad-hoc procedure to normalize genecounts, and thus neglecting the uncertainty over the sequencing depth when computingthe BF, leading to potential false positives. Figures 6(c) and 6(d) demonstrate the pos-terior means of r k based on GMNB and the means of synthetically generated samplesbased on the inferred parameters of the proposed model, respectively. Similar to theplots for LGALS1, Figures S8, S9, S10, and S11 in the supplement materials show thesimilar trends for the genes SEPT5, BATF3, COL1A2, and ENO2 based on the resultsby DyNB and GMNB.In order to further demonstrate the advantages of GMNB, the overlap of three ap-proaches (GMNB, DyNB and DESeq2-min), for 100 top diﬀerentially expressed genesidentiﬁed by GMNB, is depicted as a Venn diagram in Figure 7. A gene is diﬀeren-tially expressed based on DESeq2-min if the corresponding p-value < .

05 at any time21 a) (b)(c) (d)

Figure 6:

Example of genes detected as diﬀerentially expressed by DyNB butnot by GMNB: LGALS1. (a) The normalized gene expression proﬁle of

LGALS1 over time estimated by DyNB model. The normalization of read counts on the y-axis areobtained by using the normalization method of DESeq. The solid blue and red curves arethe posterior means under Th0 and Th17 lineages, respectively, with corresponding 99%CIs (shaded areas). (b) The gene expression proﬁle of

LGALS1 over time estimatedby DyNB. The read counts on the y-axis are observed read count. The solid blue andred curves are the means of the generated samples based on the inferred parameters byDyNB under Th0 and Th17 lineages, respectively, with corresponding 99% CIs (shadedareas around means). (c) The normalized gene expression proﬁle of

LGALS1 over timeestimated by the proposed GMNB model. The normalization of read counts on the y-axisare obtained by dividing the observed counts by the parameter p ( t ) j / (1 − p ( t ) j ) representingthe sequencing depth in the model. The solid blue and red curves are posterior meansof r k under Th0 and Th17 lineages, respectively, with corresponding 99% CIs (shadedareas). (d) The gene expression proﬁle of LGALS1 over time estimated by GMNB. Theread counts on the y-axis are observed read count. The solid blue and red curves arethe means of the generated samples based on the inferred parameters by GMNB underTh0 and Th17 lineages, respectively, with corresponding 99% CIs (shaded areas aroundmeans). 22oint. Out of top 100 diﬀerentially expressed genes identiﬁed by GMNB (log(BF) > α that is critical to drive Th17 diﬀerentiation[Corcoran and ONeill, 2016]. EGR2 has been identiﬁed as an important transcriptionfactor in the development and function of Th17 cells [Zhang et al., 2015, Zhu et al., 2008].IL6ST is known as a signature transcript of Th17 cells [Ghoreschi et al., 2010]. This againillustrates the beneﬁts of GMNB on better modeling temporal dynamic changes to detectbiologically meaningful genes who show signiﬁcant diﬀerence in temporal changes but donot show signiﬁcant diﬀerential expression when studying them at individual time points. To further demonstrate the biological relevance of the detected genes by GMNB, GOanalysis of top 100 diﬀerentially expressed genes (log(BF) > DESeq2−minGMNB DyNB

Figure 7: A Venn diagram representing the overlaps of the top 100 diﬀerentially expressedgenes detected by GMNB with DyNB and DESeq2-min.process. The most signiﬁcantly enriched GO terms are related to the organ develop-ment (p-value < × − ), immune system process (p-value < × − ), immuneresponse (p-value < × − ), response to stimulus (p-value < − ), cell diﬀerentiation(p-value < × − ), and defense response (p-value < × − ). In particular, 38% and74% of these 100 genes are annotated to immune response and response to stimulus, re-spectively, supported by the central role of Th17 cells in the pathogenesis of autoimmuneand inﬂammatory diseases [Waite and Skokos, 2011]. We further analyze the second temporal RNA-seq dataset, for which DyNB was imple-mented for studying Th17 cell lineage [ ¨Aij¨o et al., 2014]. In this dataset, CD4+ T cellswere activated and polarized as described in Tuomela et al. [2012] and RNA-seq datawere collected at 0, 12, 24, 48 and 72 hours of both the activation (Th0) and diﬀerenti-24able 2: Comparison of BF ranks for the reported genes by DyNB

Genes DyNB GMNBRORC

37 (BF = 2 . × ) (BF = 2 . × ) IL17F

352 (BF = 1 . × ) (BF = 2 . × ) IL17A

755 (BF = 6 . × ) (BF = 3 . × )ation (Th17). At each time point, there are 3 biological replicates for both cell lineages.The original paper [ ¨Aij¨o et al., 2014] performed DyNB to quantify Th17-speciﬁc geneexpression dynamics.The authors in ¨Aij¨o et al. [2014] ﬁrst normalized the RNA-seq counts by the DESeqpipeline [Anders and Huber, 2010]. Then, DyNB was applied to the normalized expressionvalues to identify diﬀerentially expressed genes between the Th0 and Th17 lineages.Genes were considered diﬀerentially expressed if (i) BF >

10, and (ii) fold-change > IL A , IL F , and RORC .We apply GMNB to analyze the same Th17 cell lineage dataset to identify diﬀeren-tially expressed genes. To compare the ranked lists of genes by GMNB and by DyNBrespectively, Table 1 gives the ranks as well as the computed BF values by GMNB andDyNB for these reported genes in ¨Aij¨o et al. [2014]. These qRT-PCR validated genes arein fact ranked higher by GMNB, indicating more promising potential for marker geneidentiﬁcation.

GMNB oﬀers a comprehensive and fully Bayesian solution to study temporal RNA-seqdata. The most notable advantage is the capacity to capture a broad range of gene ex-pression patterns over time by the integration of a gamma Markov chain into a negativebinomial distribution model. This allows GMNB to oﬀer consistent performance over25iﬀerent generative models and makes it be robust for studies with diﬀerent numbers ofreplicates by borrowing the statistical strength across both genes and samples. Anothercritical advantage is the eﬃcient closed-form Gibbs sampling inference of the model pa-rameters, which improves the computational complexity compared to the state-of-the-artmethods. This is achieved by using a statistically well-founded data augmentation solu-tion. In addition, GMNB explicitly models the potential sequencing depth heterogeneityso that no heuristic preprocessing step is required. Experimental results on both syn-thetic and real-world RNA-seq data demonstrate the state-of-the-art performance of theGMNB method for temporal diﬀerential expression analysis of RNA-seq data.

This research was supported in part by the NSF Awards CCF-1553281, CCF-1718513,and the USDA NIFA Award 06-505570-01006 to X. Qian and the NSF Award DBI-1532188 and the funding support from QNRF (NPRP9-001-2-001, NPRP7-1634-2-604),CSTR (2017-01), and CONACYT to P. de Figueiredo.

References

A. Acharya, J. Ghosh, and M. Zhou. Nonparametric bayesian factor analysis for dynamiccount matrices. In

Artiﬁcial Intelligence and Statistics , pages 1–9, 2015.T. ¨Aij¨o, V. Butty, Z. Chen, V. Salo, S. Tripathi, C. B. Burge, R. Lahesmaa, andH. L¨ahdesm¨aki. Methods for time series analysis of rna-seq data with applicationto human th17 cell diﬀerentiation.

Bioinformatics , 30(12):i113–i120, 2014.S. Anders and W. Huber. Diﬀerential expression analysis for sequence count data.

Genome biology , 11(10):R106, 2010.S. Anders, P. T. Pyl, and W. Huber. Htseqa python framework to work with high-throughput sequencing data.

Bioinformatics , 31(2):166–169, 2015.Z. Bar-Joseph, A. Gitter, and I. Simon. Studying and modelling dynamic biologicalprocesses using time-series gene expression data.

Nature reviews. Genetics , 13(8):552,2012. 26. H. Chan, J. Intosalmi, S. Rautio, and H. L¨ahdesm¨aki. A subpopulation model toanalyze heterogeneous cell diﬀerentiation dynamics.

Bioinformatics , 32(21):3306–3313,2016.E. Y. Chiang, G. A. Kolumam, X. Yu, M. Francesco, S. Ivelja, I. Peng, P. Gribling,J. Shu, W. P. Lee, C. J. Reﬁno, et al. Targeted depletion of lymphotoxin- α –expressingth1 and th17 cells inhibits autoimmune disease. Nature medicine , 15(7):766–773, 2009.E. P. Consortium et al. An integrated encyclopedia of dna elements in the human genome.

Nature , 489(7414):57, 2012.S. E. Corcoran and L. A. ONeill. Hif1 α and metabolic reprogramming in inﬂammation. The Journal of clinical investigation , 126(10):3699–3707, 2016.S. Z. Dadaneh, X. Qian, and M. Zhou. Bnp-seq: Bayesian nonparametric diﬀerentialexpression analysis of sequencing count data.

Journal of the American Statistical As-sociation , (just-accepted), 2017.Y. Doi, S. Oki, T. Ozawa, H. Hohjoh, S. Miyake, and T. Yamamura. Orphan nuclearreceptor nr4a2 expressed in t cells from multiple sclerosis mediates production of in-ﬂammatory cytokines.

Proceedings of the National Academy of Sciences , 105(24):8381–8386, 2008.M. S. Fassett, W. Jiang, A. M. D’Alise, D. Mathis, and C. Benoist. Nuclear receptor nr4a1modulates both regulatory t-cell (treg) diﬀerentiation and clonal deletion.

Proceedingsof the National Academy of Sciences , 109(10):3891–3896, 2012.K. Ghoreschi, A. Laurence, X.-P. Yang, C. M. Tato, M. J. McGeachy, J. E. Konkel, H. L.Ramos, L. Wei, T. S. Davidson, N. Bouladoux, et al. Generation of pathogenic th17cells in the absence of tgf-[bgr] signalling.

Nature , 467(7318):967–971, 2010.J. Gnanaprakasam and R. Wang. Myc in regulating immunity: metabolism and beyond.

Genes , 8(3):88, 2017.J. M. Gonz´alez-Granado, C. Silvestre-Roig, V. Rocha-Perugini, L. Trigueros-Motos,D. Cibri´an, G. Morlino, M. Blanco-Berrocal, F. G. Osorio, J. M. P. Freije, C. L´opez-Ot´ın, et al. Nuclear envelope lamin-a couples actin dynamics with immunologicalsynapse architecture and t cell activation.

Science signaling , 7(322):ra37, 2014.T. J. Hardcastle and K. A. Kelly. bayseq: empirical bayesian methods for identifyingdiﬀerential expression in sequence count data.

BMC bioinformatics , 11(1):422, 2010.K. Hayashi and A. Altman. Filamin a is required for t cell activation mediated by proteinkinase c- θ . The Journal of Immunology , 177(3):1721–1728, 2006.N. L. Johnson, A. W. Kemp, and S. Kotz.

Univariate discrete distributions , volume 444.John Wiley & Sons, 2005. 27. Keerthivasan, R. Suleiman, R. Lawlor, J. Roderick, T. Bates, L. Minter, J. Anguita,I. Juncadella, B. J. Nickoloﬀ, I. C. Le Poole, et al. Notch signaling regulates mouseand human th17 diﬀerentiation.

The Journal of Immunology , 187(2):692–701, 2011.Y. Kurebayashi, S. Nagai, A. Ikejiri, M. Ohtani, K. Ichiyama, Y. Baba, T. Yamada,S. Egami, T. Hoshii, A. Hirao, et al. Pi3k-akt-mtorc1-s6k1/2 axis controls th17 diﬀer-entiation by regulating gﬁ1 expression and nuclear translocation of ror γ . Cell reports ,1(4):360–373, 2012.C. W. Law, Y. Chen, W. Shi, and G. K. Smyth. Voom: precision weights unlock linearmodel analysis tools for rna-seq read counts.

Genome biology , 15(2):R29, 2014.N. Leng, J. A. Dawson, J. A. Thomson, V. Ruotti, A. I. Rissman, B. M. Smits, J. D.Haag, M. N. Gould, R. M. Stewart, and C. Kendziorski. Ebseq: an empirical bayeshierarchical model for inference in rna-seq experiments.

Bioinformatics , 29(8):1035–1043, 2013.N. Leng, Y. Li, B. E. McIntosh, B. K. Nguyen, B. Duﬃn, S. Tian, J. A. Thomson,C. N. Dewey, R. Stewart, and C. Kendziorski. Ebseq-hmm: a bayesian approach foridentifying gene-expression changes in ordered rna-seq experiments.

Bioinformatics ,31(16):2614–2622, 2015.J. Lienau, K. Schmidt-Bleek, A. Peters, H. Weber, H. J. Bail, G. N. Duda, C. Perka,and H. Schell. Insight into the molecular pathophysiology of delayed bone healing in asheep model.

Tissue Engineering Part A , 16(1):191–199, 2009.M. I. Love, W. Huber, and S. Anders. Moderated estimation of fold change and dispersionfor rna-seq data with deseq2.

Genome biology , 15(12):550, 2014.P. McCullagh. Generalized linear models.

European Journal of Operational Research , 16(3):285–292, 1984.A. Mortazavi, B. A. Williams, K. McCue, L. Schaeﬀer, and B. Wold. Mapping andquantifying mammalian transcriptomes by rna-seq.

Nature methods , 5(7):621–628,2008.M. J. Nueda, S. Tarazona, and A. Conesa. Next masigpro: updating masigpro biocon-ductor package for rna-seq time series.

Bioinformatics , 30(18):2598–2602, 2014.S. Oh, S. Song, G. Grabowski, H. Zhao, and J. P. Noonan. Time series expression analysesusing rna-seq: a statistical approach.

BioMed research international , 2013, 2013.M. Pasarica, B. Gowronska-Kozak, D. Burk, I. Remedios, D. Hymel, J. Gimble,E. Ravussin, G. A. Bray, and S. R. Smith. Adipose tissue collagen vi in obesity.

The Journal of Clinical Endocrinology & Metabolism , 94(12):5155–5162, 2009.M. Pesu, W. T. Watford, L. Wei, L. Xu, I. Fuss, W. Strober, J. Andersson, E. Shevach,M. Quezado, A. Roebroek, et al. T cell-expressed proprotein convertase furin is essentialfor maintenance of peripheral tolerance.

Nature , 455(7210):246, 2008.28. E. Rasmussen and C. K. Williams.

Gaussian processes for machine learning , volume 1.MIT press Cambridge, 2006.J. Reiser, B. Adair, and T. Reinheckel. Specialized roles for cysteine cathepsins in healthand disease.

The Journal of clinical investigation , 120(10):3421, 2010.M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edger: a bioconductor package fordiﬀerential expression analysis of digital gene expression data.

Bioinformatics , 26(1):139–140, 2010.S. Sawcer, G. Hellenthal, M. Pirinen, C. C. Spencer, N. A. Patsopoulos, L. Moutsianas,A. Dilthey, Z. Su, C. Freeman, S. E. Hunt, et al. Genetic risk and a primary role forcell-mediated immune mechanisms in multiple sclerosis.

Nature , 476(7359):214, 2011.A. Schein, H. Wallach, and M. Zhou. Poisson-gamma dynamical systems. In

Advancesin Neural Information Processing Systems , pages 5005–5013, 2016.D. Spies and C. Ciaudo. Dynamics in transcriptomics: advancements in rna-seq timecourse and downstream analysis.

Computational and structural biotechnology journal ,13:469–477, 2015.X. Sun, D. Dalpiaz, D. Wu, J. S. Liu, W. Zhong, and P. Ma. Statistical inference for timecourse rna-seq data using a negative binomial mixed-eﬀect model.

BMC bioinformatics ,17(1):324, 2016.S. K. Tripathi, Z. Chen, A. Larjo, K. Kanduri, K. Nousiainen, T. ¨Aijo, I. Rica˜no-Ponce,B. Hrdlickova, S. Tuomela, E. Laajala, et al. Genome-wide analysis of stat3-mediatedtranscription during early human th17 cell diﬀerentiation.

Cell Reports , 19(9):1888–1901, 2017.S. Tuomela, V. Salo, S. K. Tripathi, Z. Chen, K. Laurila, B. Gupta, T. ¨Aij¨o, L. Oikari,B. Stockinger, H. L¨ahdesm¨aki, et al. Identiﬁcation of early gene expression changesduring human th17 cell diﬀerentiation.

Blood , 119(23):e151–e160, 2012.S. Tuomela, S. Rautio, H. Ahlfors, V. ¨Oling, V. Salo, U. Ullah, Z. Chen, S. H¨am¨alist¨o,S. K. Tripathi, T. ¨Aij¨o, et al. Comparative analysis of human and mouse transcriptomesof th17 cell priming.

Oncotarget , 7(12):13416, 2016.J. C. Waite and D. Skokos. Th17 response and inﬂammatory autoimmune diseases.

International journal of inﬂammation , 2012, 2011.Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics.

Nature reviews genetics , 10(1):57–63, 2009.K. Yang, J. L. Vega, M. Hadzipasic, J. P. S. Peron, B. Zhu, Y. Carrier, S. Masli, L. V.Rizzo, and H. L. Weiner. Deﬁciency of thrombospondin-1 reduces th17 diﬀerentiationand attenuates experimental autoimmune encephalomyelitis.

Journal of autoimmunity ,32(2):94–103, 2009. 29. Yosef, A. K. Shalek, J. T. Gaublomme, H. Jin, Y. Lee, A. Awasthi, C. Wu, K. Karwacz,S. Xiao, M. Jorgolli, et al. Dynamic regulatory network controlling th17 cell diﬀeren-tiation.

Nature , 496(7446):461, 2013.M. Zhang, Y. Wang, J.-S. Wang, J. Liu, M.-M. Liu, and H.-B. Yang. The roles of egr-2in autoimmune diseases.

Inﬂammation , 38(3):972–977, 2015.S. Zhao, W.-P. Fung-Leung, A. Bittner, K. Ngo, and X. Liu. Comparison of rna-seq andmicroarray in transcriptome proﬁling of activated t cells.

PloS one , 9(1):e78644, 2014.M. Zhou and L. Carin. Negative binomial process count and mixture modeling.

IEEETransactions on Pattern Analysis and Machine Intelligence , 37(2):307–320, 2015.B. Zhu, A. L. Symonds, J. E. Martin, D. Kioussis, D. C. Wraith, S. Li, and P. Wang.Early growth response gene 2 (egr-2) controls the self-tolerance of t cells and preventsthe development of lupuslike autoimmune disease.