[PDF] repgenHMM: a dynamic programming tool to infer the rules of immune receptor generation from sequence data

Abstract

The diversity of the immune repertoire is initially generated by random rearrangements of the receptor gene during early T and B cell development. Rearrangement scenarios are composed of random events -- choices of gene templates, base pair deletions and insertions -- described by probability distributions. Not all scenarios are equally likely, and the same receptor sequence may be obtained in several different ways. Quantifying the distribution of these rearrangements is an essential baseline for studying the immune system diversity. Inferring the properties of the distributions from receptor sequences is a computationally hard problem, requiring enumerating every possible scenario for every sampled receptor sequence. We present a Hidden Markov model, which accounts for all plausible scenarios that can generate the receptor sequences. We developed and implemented a method based on the Baum-Welch algorithm that can efficiently infer the parameters for the different events of the rearrangement process. We tested our software tool on sequence data for both the alpha and beta chains of the T cell receptor. To test the validity of our algorithm, we also generated synthetic sequences produced by a known model, and confirmed that its parameters could be accurately inferred back from the sequences. The inferred model can be used to generate synthetic sequences, to calculate the probability of generation of any receptor sequence, as well as the theoretical diversity of the repertoire. We estimate this diversity to be ≈ 10 23 for human T cells. The model gives a baseline to investigate the selection and dynamics of immune repertoires.

Full PDF

rrepgenHMM: a dynamic programming tool to infer the rules of immune receptorgeneration from sequence data

Yuval Elhanati, Quentin Marcou, Thierry Mora, and Aleksandra M. Walczak Laboratoire de physique th´eorique, UMR8549, CNRS and ´Ecole normale sup´erieure, 24, rue Lhomond, 75005 Paris, France. Laboratoire de physique statistique, UMR8550, CNRS and ´Ecole normale sup´erieure, 24, rue Lhomond, 75005 Paris, France. (Dated: October 10, 2018)The diversity of the immune repertoire is initially generated by random rearrangements of thereceptor gene during early T and B cell development. Rearrangement scenarios are composed ofrandom events – choices of gene templates, base pair deletions and insertions – described by prob-ability distributions. Not all scenarios are equally likely, and the same receptor sequence may beobtained in several diﬀerent ways. Quantifying the distribution of these rearrangements is an essen-tial baseline for studying the immune system diversity. Inferring the properties of the distributionsfrom receptor sequences is a computationally hard problem, requiring enumerating every possiblescenario for every sampled receptor sequence. We present a Hidden Markov model, which accountsfor all plausible scenarios that can generate the receptor sequences. We developed and implementeda method based on the Baum-Welch algorithm that can eﬃciently infer the parameters for thediﬀerent events of the rearrangement process. We tested our software tool on sequence data forboth the alpha and beta chains of the T cell receptor. To test the validity of our algorithm, wealso generated synthetic sequences produced by a known model, and conﬁrmed that its parameterscould be accurately inferred back from the sequences. The inferred model can be used to generatesynthetic sequences, to calculate the probability of generation of any receptor sequence, as well asthe theoretical diversity of the repertoire. We estimate this diversity to be ≈ for human Tcells. The model gives a baseline to investigate the selection and dynamics of immune repertoires.Source code and sample sequence ﬁles are available at https://bitbucket.org/yuvalel/repgenhmm/downloads . I. INTRODUCTION

The ability of the adaptive immune system to identifya wide range of threats rests upon the diversity of itslymphocyte receptors, which together make up the im-mune repertoire. Each such receptor can bind speciﬁcallyto antigenic molecules, and initiates an immune responseagainst the threat.T cell receptors (TCR) are composed of two proteinchains, called alpha and beta. B cell receptors (BCR)share a very similar structure, with a light chain andheavy chain playing the same role. Each chain is pro-duced according to the same process of V(D)J rear-rangement. In each new cell and for each of the twochains, two germline segments for alpha chains (V α andJ α genes), or three segments for beta chains (V β , D β andJ β genes), are assembled together to form the recombinedgene coding for the chain. In addition, at the junctionswhere the segments are joined, the ends of the segmentsare trimmed, and random nucleotides are inserted (seeFig. 1a for a diagram describing the alpha chain rear-rangement process). This process creates a large initialdiversity of possible receptors, which are later selectedaccording to their recognition functionality. An impor-tant property of this process is that it is redundant, asmany diﬀerent V(D)J rearrangements may lead to the ex-act same sequence. It is thus impossible to unambigouslyreconstruct the scenario from the sequence alone, a prob-lem that is aggravated by sequencing errors.The rearrangement process is random, as is each ofthe elements composing it – choice of germline segments, number of deleted nucleotides, number and identity ofinsertions.Since the rearrangement process is the basis of reper-toire diversity, it is important to study its distributionquantitatively. With recent advances in high through-put sequencing, there is a growing body of data onrepertoires for both T and B cells, in a variety of sit-uations. Using large sequence datasets of rearranged,non-productive genes, the probability distribution of re-arrangement events in human TCR beta chain and BCRheavy chains could be inferred using statistical methods,gaining important insights into the random processes un-derlying repertoire diversity [1, 2]. However, these stud-ies are based on a brute force approach, which enumer-ates every possible rearrangement scenario for each ob-served sequence. This is a very computationally costlyprocedure, which is unrealistic for very large datasets.In this report we present a dynamic programming ap-proach to learn the distribution of rearrangement sce-narios from large numbers of non-productive sequencesin an eﬃcient way. This approach is based on a Hid-den Markov Models (HMM) formulation of the problem,and learns its parameters using a modiﬁed Baum-Welch(BW) algorithm to avoid the full enumerations of all sce-narios. Many studies have described algorithms designedto process large numbers of rearranged TCR or BCRgenes and extract the template V(D)J genes of the re-arrangement [3–17]. These tools process each sequenceseparately to obtain the best (but often incorrect) align-ment to a V(D)J combination, sometimes using dynamicprogramming or HMM [13, 14, 16, 17], and assume an a r X i v : . [ q - b i o . GN ] O c t implicit, ad hoc prior on rearrangements. By contrast,our algorithm explores all plausible alignments for eachsequence from data to learn accurately the distributionof rearrangement events.Once the model of rearrangement has been learnedby our procedure, the entire distribution of possible se-quences and their probabilities is accessible. Our algo-rithm can calculate the probability of any sampled se-quence, even if it is not part of the data used to learn themodel, and it can generate arbitrary numbers of syntheticsequences with the exact same statistics as the data. Itcan also calculate the entropy of the rearrangement pro-cess – a classical measure of sequence diversity. This en-ables us to further our understanding of the generationprocess, quantify the baseline state of the immune sys-tem and evaluate subsequent processes such as somaticselection. Finally, our work produces insights not juston the data sequences, but on the underlying biologicalprocesses. II. METHODSA. Model

The algorithm assumes a general form for the proba-bility distribution of possible rearrangements, and thenﬁnds the parameters of that distribution that best ﬁt thesequence data [1, 2]. For simplicity we ﬁrst describe themodel for the alpha chain of TCRs, which also appliesfor the light chain of BCRs. The case of the beta andheavy chains will be described later.The model speciﬁes probability distributions for eachof the rearrangement events: V and J gene choices P ( V, J ), number of deletions conditioned on the genebeing deleted P (del V | V ) and P (del J | J ), and inser-tion length and nucleotide identity P (ins). To-gether these distributions form the parameter set θ = { P ( V, J ) , P (del V | V ) , . . . } . The probability of a given re-arrangement scenario r = ( V, J, del V, del J, ins) is thengiven by: P rearr ( r | θ ) = P ( V, J ) P (del V | V ) P (del J | J ) P (ins) . (1)The speciﬁc form of the model assumes some depen-dencies between the events. In the above formula, forinstance, the probabilities of each V and J choice can bedependent, but the insertion is independent from both,while the deletion probabilities are dependent only on theidentity of the gene being deleted. The model choice isdone based on biological knowledge, and has been val-idated by previous work in the case of the beta chain[1].To avoid confounding factors related to thymic or pe-ripheral selection, the inference of the parameters is per-formed on non-productive genes. During the matura-tion of cells, some rearrangement events produce non-productive genes that are either out of frame, having the wrong combination of insertions and deletions, or containa stop codon.When this happens, the other chromosome in thesame cell may undergo a second successful rearrange-ment event, ensuring the survival of the cell. Yet thenon-productive rearrangements remain and are part ofthe sequence dataset. Since these sequences have no ef-fect on the cell in which they reside, they have not beenselected. Studying their statistics is thus equivalent tostudying the generation process itself, with no selection.While the model speciﬁes the distribution over rear-rangement scenarios, the data consists of recombined se-quences, denoted by s , which can be the result of dif-ferent scenarios r . The recombination events are hiddenvariables, and the likelihood of a sequence is the sum ofthe probabilities of all scenarios leading to it, P ( s ) = (cid:80) r → s P rearr ( r ). The likelihood of the sequence datasetcannot easily be maximised with respect to the modelparameters. To overcome this problem, the Expectation-Maximisation (EM) algorithm can be used to maximisethe likelihood in an iterative manner.In each iteration, new model parameters θ (cid:48) are cho-sen to increase the likelihood until the maximum is ob-tained. In the expectation step, the log-likelihood of hid-den rearrangements is averaged over their posterior dis-tribution under the current model θ , to form Q ( θ (cid:48) | θ ) = (cid:80) na =1 (cid:80) r P ( r | s ( a ) , θ ) log P arran ( r | θ (cid:48) ), where the sum on a runs over the n sequences ( s (1) , s (2) , . . . , s ( n ) ) in thedataset. The maximisation step consists of maximising Q ( θ (cid:48) | θ ) over θ (cid:48) to obtain the new parameter set. Be-cause of the simple factorised form of Eq. 1, this secondstep is equivalent to calculating the frequency of eachrearrangement event under the posterior P ( r | s ( a ) , θ ) = P ( r, s ( a ) | θ ) /P ( s ( a ) | θ ). For example, P ( V, J ) is updatedaccording to: P (cid:48) ( V, J ) = 1 n n (cid:88) a =1 (cid:88) r : V,J P ( r | s ( a ) , θ ) , (2)where the sum on r : V, J runs over scenarios with genechoices V and J . Similar update rules are used forthe other model parameters P (del V | V ), P (del J | J ), and P (ins). The sums over possible scenarios, for each datasequence s ( a ) , are computationally heavy. To make themeasier, we now introduce an equivalent HMM formulationof the model. B. HMM formulation

The almost linear structure of rearrangements allowsfor their description as a Markov chain. Hidden Markovmodels lend themselves to the much more eﬃcientforward-backward algorithm for marginal estimations,and in combination with Expectation-Maximisation, theBaum-Welch (BW) algorithm, for parameter inference.In general however, the V and J gene choices may be cor-related in their joint distribution P ( V, J ), breaking theMarkovian nature of the rearrangement statistics. Topreserve the Markovian structure, we built a separateHMM for each choice of the pair of germline genes (

V, J ),and use the forward-backward algorithm to calculate themarginals of the other rearrangement events conditionedon that choice. These conditional marginals will then becombined for all (

V, J ) to perform the maximisation stepof EM.For a chosen (

V, J ) pair, a Hidden Markov Model(HMM) is constructed to yield the recombined sequencesin accordance with Eq. 1. Fig. 1b shows a diagram of themodel. The model proceeds through a sequence of hid-den states S , which emit nucleotides s i , for i = 1 , . . . , L where L is the sequence length, thus producing the entiresequence s = ( s , . . . , s L ).We distinguish between two types of emitting states,represented by circles in Fig. 1b. First, ‘genomic states,’or V and J states, are deﬁned for each position onthe genomic templates V and J , and are denoted by V , V , ..., V end for V states, and likewise for J states.These states emit the nucleotide encoded in the genomictemplate at the corresponding position, or, with a smallerror probability p err , a diﬀerent nucleotide. These non-templated emissions can be caused by sequencing errors,B cell hypermutations or uncharted alleles. Second, ‘in-sertions states’ emit random nontemplated nucleotidesaccording to a distribution. I corresponds to the ﬁrstinserted nucleotide, I to the second one, and so on. Inaddition, we introduce two ‘ghost states’ G and G , rep-resented by rectangles in Fig. 1b, between the V and Istates, and between the I and J states. These states donot emit nucleotides and their sole function is to reducethe number of possible transitions between states by iso-lating the state types, thus easing computations.Each sequence is the result of a stochastic path througha series of states, deﬁned in Fig. 1b, and their randomemissions. To illustrate how the HMM operates, we fol-low a possible path leading to the production of a lightchain for a given choice of V and J. The chain startsfrom the V state, going along the V gene and emittingnucleotides from the gene. Most of these nucleotides arethose of the genomic template, up to the error rate. Atsome point (possibly before all V states are exhausted toaccount for potential V deletions) the process transitionsto the ﬁrst ghost state G . From G the process goes tothe ﬁrst insertion state I , or directly to G if there areno insertions. Each insertion state emits a nucleotide.After a certain number of insertions, the process movesto the second ghost state, G , and then on to a J state(but not necessarily J to account for J deletions). Fi-nally the process will continue along the J states until J end , completing the sequence.The HMM is deﬁned by two sets of parameters whichmap directly onto θ \ P ( V, J ). The ﬁrst set of parame-ters is the transition probabilities M SS (cid:48) between any twostates S and S (cid:48) connected by an arrow in Fig. 1. Thetransition rates between V states and G , and between G and J states, can be mapped onto the deletion pro- V1 V2 J1 J2… …V JIV J M V V M V m G M G G M I e G M G J V V V m V end Start G I I I end G J J J end (a)(b) I FIG. 1: (a) Schematic description of the rearrangement pro-cess for the light and alpha chains. Random V and J genes arechosen from the genome. A random number of nucleotides aretrimmed from their facing ends. These ends are then joinedwith an insertion segment of variable length and composition.(b) Markov model for this rearrangement process, when theV and J gene choices are known. By progressing one path fol-lowing the arrows, the model produces a rearranged receptorgene. Each state denoted by a circle emits a nucleotide. Vand J states each emit one nucleotide from the chosen tem-plate, up to an error rate. Emissions from the I states aredrawn from an speciﬁed distribution. The states representedby squares are nonemitting ghost states. The arrows repre-sent the allowed transitions, some of them are marked on thediagram with M SS (cid:48) . The probabilities of the transitions andemissions are the parameters of the HMM, as described in themain text. ﬁles of the V and J genes respectively, and the transitionrates between the I states and G can be mapped ontothe distribution of the number of insertions. The tran-sition matrix M SS (cid:48) is very sparse, thanks in part to theghost states,allowing for quick computations as we will see below.The second set of parameters are the emission probabil-ities E S ( s ) that nucleotide s is emitted by state S . If S is a genomic (V or J) state, then E S ( s ) = 1 − p err if s is the template nucleotide, which we denote by σ S , and p err / S is an insertion state, it is givenby a distribution E I ( s ), which we assume to be commonto all insertion states, i.e. independent of the order ofinsertions. C. A modiﬁed Baum-Welch algorithm

The Baum-Welch (BW) algorithm ﬁnds the param-eters for a given HMM which maximise the likelihoodof producing the observed sequences [18, 19]. It is aninstance of the EM algorithm, where the maximisationstep is performed using the forward-backward algorithm.Since our HMM is conditioned on the knowledge of the(

V, J ) pair, which is itself a hidden variable, BW cannotbe applied without modiﬁcation. However, we can stilluse the forward-backward algorithm to calculate the pos-terior probabilities of rearrangement events for a givensequence s = ( s , . . . , s L ) and a given putative ( V, J )choice, and combine these probabilities at the end.The forward pass of the forward-backward algorithmcalculates α i ( S ), the joint probability of the model beingin a speciﬁc state S and emitting the sequence up to the i th nucleotide, ( s , ..., s i ). The backwards pass does thesame for β i ( S ), the conditional probability of emittingthe sequence upstream from position i , given that thestate in this position is S : α i ( S ) := P ( s , ..., s i , S | V, J ) , (3) β i ( S ) := P ( s i +1 , ..., s L | S, V, J ) . (4)These probabilities are calculated using the following re-cursion relations: α i ( S ) = E S ( s i ) (cid:88) S (cid:48) M SS (cid:48) α i − ( S (cid:48) ) , (5) β i ( S ) = (cid:88) S (cid:48) E S (cid:48) ( s i +1 ) M S (cid:48) S β i +1 ( S (cid:48) ) . (6)Since our transition matrix M SS (cid:48) is very sparse, the sumover S (cid:48) has few terms and can be calculated eﬃciently.Having obtained these forward and backward probabil-ities for a sequence given a choice of V and J genes,the posterior marginal probabilities for each transition( S → S (cid:48) ), as well as the posterior emission probabilitiesare calculated as P ( S → S (cid:48) | V, J, s ) ∝ (cid:88) i α i ( S ) M S (cid:48) S E S (cid:48) ( s i +1 ) β i +1 ( S (cid:48) ) ,P (err | V, J, s ) ∝ (cid:88) i (cid:88) S ∈ V,J α i ( S ) β n ( S )(1 − δ σ S ,s i ) ,P ins ( s | V, J, s ) ∝ (cid:88) i (cid:88) S ∈ I α n ( S ) β n ( S ) δ s,s i , (7)up to a normalisation constant, where δ denotesKroeneker’s delta.The existence of ghost states requires making a smalladjustment to this scheme. Each ghost state introduces an oﬀset between the state index and the correspondingposition on the sequence. Thus, V states in the n posi-tion correspond to position n in the sequence, I states toposition n − n − V and J , we can combine them to obtain the update equationsof our modiﬁed Baum-Welch algorithm: M S (cid:48) S ← n n (cid:88) a =1 (cid:88) V,J P ( V, J | s ( a ) ) P ( S → S (cid:48) | V, J, s ( a ) ) ,p err ← n n (cid:88) a =1 (cid:88) V,J P ( V, J | s ( a ) ) P (err | V, J, s ( a ) ) , (8) E S ( s ) ← n n (cid:88) a =1 (cid:88) V,J P ( V, J | s ( a ) ) P ins ( s | V, J, s ( a ) ) , where the posterior probability P ( V, J | s ( a ) ) is calcu-lated using Bayes’ rule, ∝ P ( s ( a ) | V, J ) P ( V, J ), with P ( s ( a ) | V, J ) = (cid:80) S α L ( S ).Finally, the algorithm outputs the probability that anysequence s was generated as P ( s ) = P ( s | V, J ) P ( V, J ).This probability can be used to calculate the log-likelihood of the model, (cid:80) na =1 log P ( s ( a ) ), and track theprogress of the BW algorithm at each iteration. D. Pre-Alignments

For a given sequence, there may be many potential can-didates for the segments (

V, J ), but not all are equallyplausible, especially when sequence reads are long, andnot all should be considered. Before starting the infer-ence procedure, the sequences are locally aligned againstall genomic templates using the Smith-Waterman algo-rithm. By creating a shortlist of genomic genes thathad an alignment score larger than a tunable threshold,the inference procedure can exclude certain gene choices a priori . This saves considerable computation time byomitting rearrangement scenarios that would have a neg-ligible eﬀect due to high numbers of errors.In addition, this pre-alignment procedure provides uswith a mapping between the positions along the sequenceread and each genomic gene. Thus, each of the V and Jstates of the HMM may only be present at a single posi-tion along the sequence, drastically limiting the numberof states that we need to consider at each position andimproving the speed of computations.

E. Beta and heavy chains

For the beta chain of the TCR (or the heavy chainof the BCR), the model is similar to the one in Eq. 1,with the addition of a D gene choice, its deletions from G D D D G D D D D G G V V V end I I I end G G I I I end J J J end D P delD (0,2)P delD (1,1)P delD (2,2) FIG. 2: Subdiagram of Markov model for beta and heavychain, focusing on the D gene. Each row corresponds to adiﬀerent pattern of deletions (del

Dl, del Dr ) for the left andright ends of the D segments. State D (del Dl, del Dr ) d correspondsto the d th base in the D gene, when l bases are deleted fromthe left and r from the right. Each row is entered from theghost state G with probability P del D ( l, r ) = P (del Dl, del Dr ),and then proceeds deterministically until G . both the left and right sides (del Dl, del Dr ) and two in-dependent insertion events at the VD and DJ junctions(ins V D, ins DJ ): P rearr ( r | θ ) = P ( V, D, J ) P (del V | V ) P (ins V D ) × P (del Dl, del Dr | D ) P (ins DJ ) P (del J | J ) . (9)A similar HMM as for the alpha chain can be built byadding genomic D states and having two types of inser-tion states, for the VD and DJ junctions, and extra ghoststates to separate them. Then, the procedure describedabove can be applied mutatis mutandis .In addition to the V and J gene, the D gene has tobe chosen. An HMM is built for each triplet of genomicsegments ( V, D, J ). D genes are short and deleted onboth sides. For this reason, they are much harder toalign. Even the position of a candidate genomic D seg-ment along the sequence is no longer unambiguous as isthe case for V and J segments. For this reason, we donot pre-align the D genes to the sequence. Instead, foreach sequence all D genes with all possible locations areconsidered. Technically this means that the sequence po-sitions at which D states may occur are not pre-speciﬁed.During the rearrangement process D genes are deletedfrom both sides. The number of bases truncated from theleft and right ends of the gene are correlated – in the ex-treme case, the sum of both deletions cannot exceed thelength of the gene. Since the left and right D deletionscorrespond to transitions from non-adjacent states, thiscorrelation cannot be described using a Markov model.We have addressed this issue by enumerating all the pos-sible deletions from the left and the right as separatestates. In practice, we deﬁne diﬀerent types of D statesfor each choice of deletions, as depicted in Fig. 2. EachD deletion proﬁle (del

Dl, del Dr ) deﬁnes a separate sub-chain of the Markov chain going along the D gene, which is entered from the previous ghost state with probability P (del Dl, del Dr | D ). F. Entropy estimates

The inferred model can be used to characterise the di-versity of the distribution of all possible receptors, andnot just of the sampled sequences used when inferringthat distribution. We quantify the diversity of a stochas-tic quantity X using the Shannon Entropy: H ( X ) = − (cid:80) X P ( X ) log P ( X ) (measured in bits). For instance, H ( s ) gives a measure of sequence diversity. Since a uni-form distribution with 2 H outcomes has entropy H , thenumber 2 H can, even for non-uniform distributions, beinterpreted as an eﬀective diversity number, sometimescalled true diversity. The entropy of rearrangements canbe calculated explicitly thanks to the factorized form ofthe distribution. For example, for alpha chains: H ( r ) = H ( V, J ) + (cid:88) V P ( V ) H (del V | V )+ (cid:88) J P ( J ) H (del J | J ) + H (ins) . (10)This expression clearly separates the contributions from( V, J ) segment choice, deletions and insertions. The en-tropy of sequences H ( s ) cannot be calculated in this waysince receptor sequences can be produced by multiplerearrangements, but can easily be estimated by averag-ing log P ( s ), where P ( s ) is calculated by the forward-backward algorithm as explained above. III. RESULTSA. Implementation

The method was implemented in C++ (std11) as acommand line tool. OpenMP was used for paralleliza-tion.The main pipeline has two parts, the alignment mod-ule and the inference module. The input for the entirepipeline is a FASTA or plain text ﬁle with unique re-combined and non-productive nucleotide sequences, andFASTA ﬁles with the genomic templates for the diﬀerentV, J germline segments (as well as D in the case of heavyor beta chains). Genomic ﬁles can be obtained from ge-nomic databases such as IMGT [20]. The alignment codewas written to read IMGT FASTA ﬁles and extract genenames.Between the alignment and inference procedures,alignments below a certain threshold are discarded to im-prove performance. The threshold can be tuned for diﬀer-ent data sets. Sequences that do not align well to at leastone known gene of each type are completely eliminated asa quality control. Setting the thresholds should be donecarefully, keeping a large majority of the sequences withat least one good alignment, but excluding ones whichhad only low score alignments. To this end, an auxiliarymodule is included that outputs statistics on the bestalignment score for each sequence. Curated alignmentﬁles are saved at the end of the alignments stage, andused as input for the inference. Apart from the align-ments, the only parameters needed for inference are themaximum numbers of insertions and deletions.The output of the main pipeline is the value of themodel parameters: the joint distribution of segment us-age; the distributions of deletions at each of the dele-tion sites conditioned on the gene; the distribution ofthe number and composition of inserted nucleotides atthe junctions; and a global error rate accounting for se-quencing errors, hypermutations or genomic variants.Two more modules are included. First, given a list ofsequences and an inferred model, the software can com-pute generation probabilities for all sequences. Second, asynthetic sequence generation module that can producesequences from a given model. This module can be usedto study the properties of the distribution or to verifythe inference algorithm using a known model, as we willsee below.

B. Application to alpha chain data

The algorithm was applied to human TCR alpha chainsequences from [21]. The data consist of around 80,000non-productive sequences, each 151bp long, coveringboth the V and J segments. Sequences were alignedto lists of genomic sequences from the IMGT onlinedatabase [20], and then given as input to the inferencealgorithm. The model converged rapidly as can be seenby the quick saturation of the likelihood (Fig. 3a).The entropy of the rearrangement process quantiﬁesthe diversity of possible scenarios. It was calculated us-ing Eq. 10 and found to be 32 bits (Fig. 3b, top line).This entropy can be partitioned into contributions fromeach of the rearrangement events – segment choice, in-sertions and deletions (bottom line). The largest contri-bution to the entropy comes from the insertions, followedclosely by gene choice. We also estimated the entropy ofthe sequence distribution (middle line), which is smallerthan the rearrangement entropy because of convergentrearrangements – multiple rearrangements leading to thesame sequence (grey box). This estimate was based onsamples of simulated sequences. Undersampling bias anderror were corrected for by using samples of diﬀerent sizesand validating the convergence of the entropy. The en-tropy of the alpha chain sequences is 30bits, which cor-responds to a diversity number of about 10 .Inferred insertion and deletion proﬁles for the alpharearrangements are shown in Figures 3c and d, with thedeletion proﬁle averaged over genes. The peak of theinsertion distribution is at 5bp, similar to previous resultsfor the beta chain. The joint distribution of gene usagefor V and J, represented in Fig. 3e, shows a wide variety TRAJ T R AV × -3 iteration l og L i k e li hood × -5.84-5.83-5.82-5.81 × T R AV -3 -2-1.5-1-0.500.511.5 DelJ:3.1bitsDelV:2.8bitsIns Length:3.9bits

Total Recombination Entropy:32 bitsNucleotide Sequence : 30 bitsInsertion Nucleotides:11bitsVJ Choice:11bits (a)(b)(c) (d)(e)(f) number of nt insertions0 5 10 15 20 25 p r obab ili t y inferred modelbest alignments number of deletions averaged over genes0 5 10 15 20 25 p r obab ili t y V deletionsJ deletions

FIG. 3: TCR alpha chain rearrangement distribution inferredfrom sequence data taken from [21]. (a) The log-likelihood ofthe data given the model saturates as a function of the numberof iterations of the Expectation-Maximisation algorithm. (b)Shannon entropy of rearrangements (top row) and sequences(middle row). The sequence entropy is lower than the totalrecombination entropy because of convergent rearrangements.The rearrangement entropy is the sum of entropies of its ele-mentary events (bottom row). (c) Distribution of the numberof inserted nucleotides (red curve). For comparison, the samedistribution obtained by taking the best alignment of each se-quence to the genomic templates is represented by the blackdashed line. (d) Distributions of the number of deletions forboth V and J genes. These distributions, which are gene de-pendent, are here averaged over genes. (e) Joint distributionfor V and J usage, P ( V, J ). Genes are ordered by positionalong the genome. (f) The covariance P ( V, J ) − P ( V ) P ( J )clearly shows strong correlations for genes that are either closeto the separation between the V and J segments, or far fromit. generating model × -3 i n f e rr ed m ode l × -3 P(V,J) number of inserted nucleotides0 5 10 15 20 25 p r obab ili t y trueinferred A C G T p r obab ili t y (a) (b) FIG. 4: Performance of the algorithm on synthetic data. Se-quences generated using a known model were given as an in-put to the inference algorithm. The results of the inferenceare compared to the true model used for generation, for (a)the distribution of the number of insertions (inset: usage of in-serted nucleotides), and (b) V, J gene usage. The error bars,which correspond to sample noise, are smaller than symbolsize for (a). of gene usage probabilities, with clear dependencies onthe ordering of genes according to their location on thechromosome [22]. To better see these dependencies, inFig. 3f we plotted P ( V, J ) − P ( V ) P ( J ) as a measure ofthe correlation between V and J choices.In [23], it was proposed that rearrangements can oc-cur in several steps. When a V and a J segment arejoined, the genomic segments that were between themare excised, but the segments located on the outer ﬂanksremain. Then, successive rearrangements joining theseouter segments might occur. It was hypothesised thatearly joining events involve V and J genes that are closeto each other, hence proximal to the boundary betweenthe V and J cassettes. Later joining events, on the otherhand, should involve more distal genes as the proximalgenes are likely to have been excised. This phenomenonis expected to lead to correlations between pairs of geneswhich are either both distal or both proximal, which isconsistent with the results of Fig. 3f. Notice also thatin the intermediate range our model predicts low corre-lation within a certain window of distances between theV and J genes. C. Test on synthetic data

In order to check the validity of the algorithm, we ran iton sequences that were produced according to a knownmodel. We generated 100,000 synthetic sequences ac-cording to the model learned in the previous section,and relearned a model from these sequences using ouralgorithm. In Fig. 4 we compare the parameters of themodel used for generation to those of the inferred model.Sampling was repeated 5 times to estimate sample noise,which was found to be very small for all paramaters, ex-cept for gene usage (error bars in Fig. 4b).The insertion bias, i.e. the usage of the diﬀerent nu-cleotides in inserted segments, is inferred almost per-fectly, as seen in Fig. 4a. The probabilities for each V,Jchoice also show excellent agreement, within sampling number of deletions averaged over genes0 5 10 15 20 25 p r obab ili t y V deletionsJ deletions (a) (b) number of insertions0 5 10 15 20 25 p r obab ili t y VD insertionsDJ insertionsalpha insertions

A C G T00.20.4

VD senseDJ antisense

FIG. 5: TCR beta chain rearrangement distribution inferredfrom sequence data previously analysed in [1]. (a) Distribu-tion of the number of insertions at both VD and DJ junctions,and comparison with the distribution of insertion in the alphachain from Fig. 3c. Inset: The nucleotide usage is identical forVD and DJ insertions when considered on opposite strands.(c) Distribution of the number of deletions on both the V andJ genes, averaged over diﬀerent genes. errors (Fig. 4b). The distribution of the number of in-sertions also agrees very well (Fig. 4c). However, wheninferring the same model, but with a higher error rate(1%), we found that the inferred insertion distributionnoticeably overestimated the probability of zero inser-tions.

D. Application to beta chain data

The beta and heavy chain algorithm was applied to thehuman TCR beta data that was already analysed in [1]using brute-force methods. The inferred model param-eters were all very similar to those reported in [1], con-ﬁrming the validity of our algorithm. The distributionof the number of insertions and deletions are displayedin Fig. 5. Remarkably, insertion proﬁles at the two in-sertion sites are very close to each other, as previouslyreported, but they also closely match the insertion pro-ﬁle of the alpha chain (Fig. 5a). Nucleotide usage in eachof the two inserted regions (between V and D, and be-tween D and J) is shown in the inset. The VD insertionbase usage is similar to the usage of the complementarybases (antisense) in the DJ region, suggesting that the bi-ological mechanism is operating on the opposite strandsfor both insertions types, as previously noted [1]. Fromthe computational point of view, because the algorithmenumerates both the D gene choice and its deletions, itsbeneﬁt is smaller for beta chains than for alpha chains.

IV. DISCUSSION

The strength of the adaptive immune system lies in itsinherent diversity. This dynamic diversity is the result ofseveral stochastic biological mechanisms, including com-plex enzymatic reactions that alter the DNA structureand composition. The action of some of the enzymessuch as RAG and TdT have been studied extensively(see [24], [25] and [26] for reviews). In our work we havestudied the way in which the diversity is generated ina top down approach, focusing on statistical propertiesas inferred from sequenced receptor genes. In principle,this is a computationally diﬃcult problem, due to thevery large number of possible rearrangement scenarios.The algorithm described here can be used to study theproperties of generation of receptor chains of B and Tcells in any species, from large sequence data sets. Ourdynamic programming approach, which is a variant ofthe Baum-Welch algorithm, takes advantage of the lin-ear structure of rearrangements to avoid a full enumera-tion of scenarios. In a brute-force approach such as theone presented in [1] and [2], the algorithmic complexityscales as the product of the numbers of each independentrearrangement event. By contrast, the complexity of themethod we presented here is additive with the number ofinsertions and deletions.Despite technological advances, sequencing techniquesstill introduce errors. In addition, allelic variants and hy-permutations in B-cells introduce additional discrepan-cies between known template genes and sequence reads.Our method can be used with models that account forsuch events. In the presented version of the algorithm,substitution errors are already fully accounted for. In-sertion and deletion errors could likewise be handled byadding transitions that skip or repeat bases.We ﬁnd many common features of generation for thealpha and beta chains of the TCR. There is a diﬀerenceof diversity, due to the greater length and complexityof the beta compared to the alpha chain. The diver-sity number of the beta chain repertoire, estimated to beapproximately 10 in [1], is therefore much larger than that of the alpha chain reported here, 10 . Assumingthat the rearrangements of the alpha and beta chains areindependent, the total TCR diversity is about 10 .Inferring statistical properties of the underlying bio-logical processes can be thought of as a diagnostic toolfor the properties of the immune system, and could beused in a variety of clinical settings. The generation pro-cess and subsequent selection shape the initial diversityof the immune system, and we have found this processto be remarkably universal across normal, healthy hu-mans, expect for slight variations in gene usage ([1, 2]).However, infections or irregularities in the immune sys-tem can be seen as perturbations that will change thesedistributions. By comparing the normal distributions todiﬀerent pathological situations, information on the re-action of the immune system can be extracted. This in-formation could in turn be used for diagnosis, using fastcomputational tools such as the one presented here.With more and more immune repertoire sequence databeing collected, eﬃcient algorithms are needed. The abil-ity to quickly infer and analyse large data sets is essentialboth for our basic understanding of the adaptive immunesystem and also for speciﬁc clinical applications. Acknowledgements

We thank members of the Chudakov and Lebedevgroups at the Institute of Bioorganic Chemistry of theRussian Academy of Sciences, and in particular M.Pogorelyy, for sharing their TCR alpha data with us.This work was supported by European Research CouncilStarting Grant n. 306312. [1] Murugan A, Mora T, Walczak AM, Callan CG (2012)Statistical inference of the generation probability of T-cell receptors from sequence repertoires.

Proceedings ofthe National Academy of Sciences of the United States ofAmerica

Philosophical transactions ofthe Royal Society of London. Series B, Biological sciences

Plos One

Research in Computa-tional Molecular Biology SE - 7 , Lecture Notes in Com-puter Science, ed Przytycka TM (Springer InternationalPublishing) Vol. 9029, pp 44–59.[5] Russ DE, Ho KY, Longo NS (2015) HTJoinSolver: Hu-man immunoglobulin VDJ partitioning using approxi-mate dynamic programming constrained by conservedmotifs.

BMC Bioinformatics

BMC Bioin-formatics

Nucleic acids research

Journal of immunology (Baltimore, Md. :1950)

Immunology

Proceedings ofthe National Academy of Sciences of the United States of America

Philosophical transactionsof the Royal Society of London. Series B, Biological sci-ences

Bioinformatics

BMC bioinformatics

Bioinformatics (Ox-ford, England)

Bio-logical sequence analysis: probabilistic models of proteins and nucleic acids (Cambridge university press).[19] Bishop CM (2006)

Pattern recognition and machinelearning (springer).[20] Giudicelli V, Chaume D, Lefranc MP (2005)IMGT/GENE-DB: a comprehensive database for humanand mouse immunoglobulin and T cell receptor genes.

Nucleic acids research

Proceedings of the National Academy of Sci-ences of the United States of America

The T Cell Receptor Facts-Book , Factsbook (Elsevier Science).[23] Warmﬂash A, Dinner AR (2006) A Model for TCR GeneSegment Use.

The Journal of Immunology

Annual review of genetics

Cell