Learning the heterogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data
LLearning the heterogeneous hypermutation landscape of immunoglobulins fromhigh-throughput repertoire data
Natanael Spisak, Aleksandra M. Walczak, ∗ and Thierry Mora ∗ Laboratoire de physique de l’ ´Ecole normale sup´erieure, CNRS, PSL University,Sorbonne Universit´e, and Universit´e de Paris, 75005 Paris, France
Somatic hypermutations of immunoglobulin (Ig) genes occurring during affinity maturation driveB-cell receptors’ ability to evolve strong binding to their antigenic targets. The landscape of thesemutations is highly heterogeneous, with certain regions of the Ig gene being preferentially targeted.However, a rigorous quantification of this bias has been difficult because of phylogenetic correlationsbetween sequences and the interference of selective forces. Here, we present an approach that correctsfor these issues, and use it to learn a model of hypermutation preferences from a recently publishedlarge IgH repertoire dataset. The obtained model predicts mutation profiles accurately and in areproducible way, including in the previously uncharacterized Complementarity Determining Region3, revealing that both the sequence context of the mutation and its absolute position along the geneare important. In addition, we show that hypermutations occurring concomittantly along B-celllineages tend to co-localize, suggesting a possible mechanism for accelerating affinity maturation.
I. INTRODUCTION
B cells are a crucial player in the adaptive immunesystem. Swift eradication of pathogens is enabled by theproduction of immunoglobulins (Ig) that bind tightly toantigens, helping in their detection, neutralization, andremoval. Achieving high accuracy and breadth relies onthe extraordinary diversity of the B cells repertoire. Theprocess of V(D)J recombination results in a highly di-verse population of naive cells [1–7]. In addition, B cellsundergo affinity maturation, a Darwinian process [8] inwhich mutations are introduced to the immunoglobulin-coding gene and highest affinity mutants are selected [9].This process is driven by a very high rate of somatichypermutations (SHM), ∼ − per basepair per cell di-vision [10], targeting the Ig genes. Some receptor genescan ultimately accumulate up to 30% amino acid substi-tutions, considerably altering the initial genotype. Thebroad diversity created by SHM ultimately ensures theemergence and selection of strong antigen binders. Un-derstanding SHM and their statistics is key to designingbetter vaccination strategies [11, 12].Like the VDJ recombination process, SHM are charac-terized by heterogeneous preferences. Mutational path-ways affect the Ig genes unevenly, with ‘cold’ and ‘hot’spots along the receptor gene, even before somatic se-lection introduces further biases [12]. SHM is initi-ated by Activation-Induced cytidine Deaminase (AID)through the deamination of deoxycytidines triggering anarray of error-prone repair pathways [13]. AID and re-pair enzymes preferentially target certain regions of thegene. However, a quantitative picture of how these pro-cesses and their context dependencies result in the ob-served heterogeneous mutational landscape is lacking.High-throughput repertoire sequencing of the Ig gene ∗ Corresponding authors. These authors contributed equally. [2, 3, 14, 15] has facilitated the development of effectivemodels from a detailed analysis of mutational profiles ofIg sequences before [5, 16, 17] or after selection [18–22].However, the spatial organization of mutations, their con-text preferences, and their interplay with selection duringaffinity maturation are still poorly understood, in partdue to a number of confounding factors.A fundamental issue is the bias of selection, whichfavors beneficial mutations over deleterious ones in theobserved repertoire. This bias can be partially circum-vented by analyzing synonymous substitutions [16], withthe limitation that extrapolation is required to generalizeto non-synonymous ones. Another way around selectionis to study passenger nonproductive sequences, which areunsuccessful products of VDJ recombination and thusunaffected by selection [5, 17, 22]. These sequences makeup a minority of DNA sequences, and are rarely foundin mRNA sequences because of allelic exclusion, whichlimits their use to very large datasets.Another confounding factor arises from phylogeneticbiases due to the complex multi-lineage structure of therepertoire. While methods have been developed to infersubstitution rates from lineages in a lineage-specific [21]or repertoire-wide way [23], they do not aim to correctfor selection and do not address the question of hyper-mutation targeting.Here we propose a new framework for quantifyingand predicting immunoglobulin mutability. The modelis trained on the reconstructed phylogenies of nonpro-ductive lineages from very large published B cell reper-toires totalling around half a million nonproductive se-quences [7], allowing us to overcome previous limitationsof dataset sizes. The approach accounts for both phyloge-netic and selection biases, and allows us to study in detailthe spatial and context preferences of hypermutation tar-geting, and to reveal the co-localization of contemporarymutations. a r X i v : . [ q - b i o . GN ] S e p II. RESULTSRepertoire-wide framework to model intrinsicmutabilities from out-of-frame lineages
Out-of-frame Ig sequences are byproducts of the VDJrecombination process that are made non functional bya frameshift in the CDR3 region. Since each cell has twocopies of the Ig genes, out-of-frame rearrangements maysurvive in the cell if recombination on the second chro-mosome is successful. The mechanism of allelic exclusionensures that only the functional variant is expressed. Yet,out-of-frame IgH sequences comprise ∼
2% of rearrange-ments in Ig mRNA sequencing experiments, and ∼
9% ingenomic DNA [6]. When a B-cell clone harboring bothan out-of-frame and a functional rearrangement under-goes affinity maturation, the out-of-frame sequence actsas a passenger and mutates alongside the functional se-quence, with the selection pressure acting only on thelatter. While the two sequences share the same phy-logeny, mutations found in out-of-frame lineages are notexpected to be subject to selection.To model the process of SHM, we reconstruct the evo-lutionary history giving rise to the observed mutationpatterns in nonproductive rearrangements. We analyseddata consisting of the IgG repertoires of 9 individualsfrom Ref. [7], obtained by the targeted mRNA sequencingof the Ig heavy (IgH) chain locus. We pre-processed andaligned raw IgH sequences to keep only out-of-frame se-quences. We then grouped sequences into clonal familiesthat originate from the same ancestor using single linkageclustering (Fig. 1A). The size of clonal families typicallyfollows a power-law distribution (Fig. 1C). As a result,many lineages are represented by one or a few sequences.We focused on sufficiently large lineages (comprised of atleast 6 distinct sequences) and reconstructed their lineagestructure, using maximum likelihood [24, 25] to infer thetopology of the underlying tree, and marginal reconstruc-tion for the identity of ancestral states. This provides uswith a list > ,
000 mutation events occurring betweenthe most recent common ancestor of lineages and theirleaves.Using lineage information is essential for multiple rea-sons. First, it allows for a better estimate of the sequencecontext in which a mutation appears. In this paper wedefine the context as the 5-mer sequence comprising themutated basepair flanked by 2 basepairs on each side.In the absence of lineage information, the best guess forthe 5-mer context would be given by the genomic se-quence of the V, D, or J segment where the mutationarose. But that context may itself be affected by otherprior mutations. The tree structure allows us to identifythe order of mutations and reconstruct the probable 5-mer context in which each mutation occurred. Second,for the same reason, the tree structure can help identifymutations in the hyper-variable CDR3 region, includingin the junctions made of nontemplated insertions. Thismakes it possible to estimate the hypermutation rate in these regions. Together, these improvements mean thatmutations can be identified within a broader range of 5-mer contexts, and their corresponding mutabilities betterestimated. Third, lineage structure helps reduce contam-ination from sequences that have been under some selec-tion. In some rare events, during affinity maturation asomatic insertion or deletion may be introduced in theCDR3 of a productive sequence, which would lead us toclassify it as out-of-frame, even though it has been sub-ject to selection prior to the frame-shift event. Focusingon mutations happening downstream of the most recentcommon ancestor, which is already out of frame, help usdiscards those contaminating events.Given a model P ( s → s (cid:48) | t, θ ) of sequence evolutionfrom s to s (cid:48) , where t is fraction of mutated positionsbetween s and s (cid:48) , (called branch length, equal to thenumber of mutations divided by alignment length), and θ denotes model parameters, we can write the joint like-lihood of mutation events in each lineage as P ( S | T, θ ) = (cid:89) ( i,j ) ∈ T P ( s i → s j | t j , θ ) , (1)where S is the set of sequences (observed and recon-structed) at each node of the tree, and T encodes the re-constructed phylogenetic tree through its branches ( i, j ).We assume every position x of the sequence s evolvesindependently inside each branch. Mutations occur ac-cording to a set of Poisson clocks with sequence- andposition-dependent rates, µ s,x , expressed per unit timeof branch length. During t some positions will mutateand others will remain unchanged, so that P ( s → s (cid:48) | t ) = (cid:89) x | s ( x ) (cid:54) = s (cid:48) ( x ) (1 − e − µ s,x t ) (cid:89) x | s ( x )= s (cid:48) ( x ) e − µ s,x t . (2)We assume that mutability depends independently on thelocal 5-mer sequence context centered around the muta-tion, w ( s, x ) = ( s ( x − , . . . , s ( x + 2)), and on the abso-lute position x along the gene (measured as the distancefrom the 5 (cid:48) end of the gene), so that µ s,x = γ w ( s,x ) β x . Inabsence of context and position dependence, we wouldhave µ = 1 by construction. Thus values of γ w or β x above 1 imply higher mutabilities than average, and viceversa for values below 1. To lift the degeneracy in overallscale between β x and γ w , we impose (cid:104) β x (cid:105) = 1.Overall, the model has 4 = 1024 parameters for γ w corresponding to each 5-mer, and L = 400 parametersfor β x corresponding to each possible position. We inferthese parameters from repertoire-wide sequencing databy maximizing the total log-likelihood of mutations in allbranches in all lineages, L ( β, γ ) = (cid:80) ( S,T ) ln P ( S | T, β, γ ),with respect to ( β, γ ), using an iterative procedure.
Validation on synthetic data
We first tested the ability of the inference frameworkto recover true mutability parameters using synthetic Clonal family size10 D i s tr i bu t i o n . . . . . . . D i s tr i bu t i o n Donors316188326650326651326713326737326780326797326907327059
A B C DE
Pre-MRCAPost-MRCAMutations
ACGTGACCGTAGTCTTTACTGTGACACGTGAACGTAGTCTTTACTGTGAC w ( s, x )
100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = 1True profile 100 200 300 400Position x Recovered = x Recovered = x True profiles
A B C
CDR3 CDR3
FIG. 2: Validation of the SHM model inference framework on synthetic data generated with the S5F model [16]. A. Inference ofcontext mutabilities γ w . B,C: Inference of position mutabilities β x for flat (B) and sinusoidal (C) profiles. Error bars correspondto 95% confidence intervals. The frequency at which a given position belongs to the CDR3 region is indicated with the greyshaded area. Mutabilities depend on both sequence context andposition
Confident that our procedure is able to infer rates re-liably, we next applied it to real data, consisting of theout-of-frame lineages from Ref. [7]. The inferred depen-dencies of mutability with context and position are pre-sented in Fig. 3. We represent context dependence usinga flat variant of the “hedgehog” plots used in Ref. [16],for A-, T- , C-, and G-centered motifs (Fig. 3A-D). Fullparameter tables are available at https://github.com/statbiophys/shmoof .Context dependent rates for A-centered motifs corre-spond well to the standard WA classification [26]: 76%of A-centered 5-mers with γ w > γ w (cid:46)
1. T-centeredmotifs are dominated by coldspots and their mutabili-ties do not align well with their corresponding reversecomplement counterparts. This is in agreement with theknown property of Polymerase η to be prone to errors atA nucleotides on the top strand [27].The C- and G-centered motifs have largely reverse-complement-symmetric rates (see Fig. S2). As previ-ously noted [16], this is in agreement with the strand-symmetric targeting of C/G-centered motifs by the AIDenzyme.The previously reported WRCY/RGYW motif [13,28] predicts high mutability reasonably well, while theSYC/GRW class of motifs [29] explains well a good frac-tion of coldspot motifs. Importantly, a large number ofhigh or low mutability 5-mers do not belong to any of thepreviously reported motifs (see Supplementary Tables 1and 2).The rugged profile of position dependence (Fig. 3E)shows clear enrichment in mutations in the CDR1 and CDR2 regions, reflected in the up to 2-fold increase ofthe position-dependent rates. Framework regions are lessmutated and we also observe a slight drop in the muta-bilities of the positions beyond the Cysteine anchor ofthe CDR3 region. We also learned models where the po-sition was defined from the 3’ end of the sequence in theJ segment (Fig. S3), yielding similar results but no clearimprovement over 5’-based position. High mutability ofCDR1 and CDR2 was already noted [30] and justifiedas an enrichment in highly mutable motifs (as quantifiedwith the S5F model). Our findings suggest that there isa secondary mechanism of this enrichment, having to doeither with accessibility of mutation-inducing enzymes ora superposition of context-dependent effects that evadethe assumption of independent evolution at different sitesand the limitation of 5-mer motifs.Note that introducing the explicit position dependencedoes affect the learning of the context-dependent param-eters: learning γ w with no position dependence (fixing β x = 1) yields similar but markedly different parametersthan when learning a free β x ( r = 81%, Fig. S4). Model is consistent across individuals and explainsdata better than previous approaches
To check the model’s generality, we estimated its vari-ability across individuals by computing Pearson’s cor-relation coefficient between the context ( γ w , Fig. 4A)and position ( β x , Fig. 4B) mutability profiles of differentdonors. The precision with which we can estimate modelparameters depends on the number of sequences used forinference, particularly for rare 5-mer contexts. Becausetwo individuals had many more reads than the others 7,we pooled together these seven individuals to make com- M u t a b ili t i e s AA C G T A C G T A C G T A C G T A C G T M u t a b ili t i e s TT G C A T G C A T G C A T G C A T G C A M u t a b ili t i e s CC G T A C G T A C G T A C G T A C G T A M u t a b ili t i e s GG C A T G C A T G C A T G C A T G C A T
100 200 300 400Position x . . . . . . . M u t a b ili t i e s Context dependence: w > w < x ABCDE
CDR1 CDR2 CDR3
FIG. 3: Context and position dependent model parameters. Context-dependent mutabilities γ w for A- (A),T- (B),C- (C) andG-centered (D) 5-mers. The colors indicate known hopspot and coldspot motifs. E. Position-dependent mutabilities β x . Grayshadings show the probability to be in the CDR1, CDR2, or CDR3 regions. Error bars correspond to 95% confidence intervals.See Fig. S5 for an full analysis of parameter uncertainty. parisons with similar dataset sizes (Fig. 4C). We thencompared the 2 individuals and 1 meta-individual witheach other and with a model learned on data from allindividuals. For the 2 individuals with the largest reper-toire datasets, the results are highly reproducible withPearson’s r = 78% for context and r = 70% for po- sition parameters (Fig. 4A), suggesting that the modelcaptures universal biochemical properties of the hyper-mutation process.To further validate the model’s accuracy, we comparedits prediction to data on the V-specific mutation profiles,which consist of the position-dependent mutation rate A ll O t h e r All326651326713Other 7
Context mutabilities A ll O t h e r All326651326713Other 7
Position mutabilities A ll O t h e r N u m b e r o f m u t a t i o n s . . . . . . r
100 150 200 250 300Position in the V gene, x IMGT . . . . . . F r a c t i o n o f m u t a t i o n s IGHV3-23ObservedExpected -
23 3 - - - - - - - -
74 4 - - -
51 1 - - -
13 6 - - - - - IGHV genes (20 most frequent)0 . . . . . P e a r s o n r . . . . . . . . . . F r a c t i o n o f m u t a t i o n s ObservedExpected Cys104 Try118Position in CDR30 . . . . . . F r a c t i o n o f m u t a t i o n s CDR3 length of 50 ntObservedExpected 20 40 60 80 100CDR3 length0 . . . . . . . . P e a r s o n r Mutabilities w S F m u t a b ili t i e s CDR1 FWR2 CDR2 FWR3 CDR3 FWR40 . . . . . . P e a r s o n r Models w x w x S5F
A B CEDF G HJI
CDR1 CDR2
FIG. 4: The model explains the data. Observed profiles were measured across the entire dataset used for model inference.See Fig. S6 for an equivalent figure when data was divided into training and testing sets. Reproducibility of parameters forindividual-specific models: context (A) and position (B) mutabilities. C. Number of mutations used for inference D. An examplemutation profile in the most common V gene. E. Model performance across V genes. F. Mutation profile in the FWR4 region.G. An example mutation profile in the CDR3 region for CDR3 length of 50 nts. H. Model performance across CDR3 lengths.I. Comparison with the S5F model. J. Summary of models performance across sequence regions. We compare the full context-and position-dependent model ( γ w β x ) with purely context- ( γ w ) and position- ( β x ) dependent models as well as with the S5Fdefault model. for each V segment. These rates result from the com-bined effect of position and context, but they are notfitted directly by the model. A typically good exam-ple of such a profile is shown in Fig. 4D. The predictionis generally excellent (Pearson’s r ∼ − γ w with noattempt to disentangle position dependence, so a directcomparison is subject to caution. Besides, S5F mutabil-ities are learned from synonymous mutations of produc-tive sequences, requiring extrapolation methods to coverall 1024 contexts, all of which do not occur with syn-onymous mutations. Yet, the two sets of mutabilities γ w correlate fairly well ( r = 36%, Fig. 4I). Correlationrises to r = 44% for contexts appearing in synonymousmutations, versus r = 18% for the other contexts forwhich S5F recourses to extrapolation, emphasizing thelimitations of that extrapolation.A summary of the performances of the different model-ing approaches on the mutabilities in the different regionsof the IgH gene is shown in Fig. 4J. We also checkedfor overfitting by dividing the dataset into a trainingand a testing, finding similar results (Fig. S6). The fullposition and context dependent model ( µ s,x = γ w β x )performs better than models with context or positionalone ( µ s,x = γ w and µ s,x = β x ). While the contextexplains the bulk of the mutation profile, adding po-sitional effects substantially boosts performance. Ourmodel clearly outperforms the S5F model, although itshould be reminded that S5F was trained on a distinctdataset. Re-training S5F on the productive sequencesfrom the present datasets using the procedure describedin the original article [16] actually yielded worse perfor-mance (data not shown), for reasons that are unclear tous. Overall, accounting for phylogeny and disentanglingthe combined effects of context and position allows ourmodel to accurately predict mutabilities including in thehyper-variable CDR3 region. Co-localization of mutations cannot be explained bycontext and position bias
It was previously observed that hypermutations tendto cluster along genomic position in nonproductive se-quences [22]. However, the origin of this phenomenonand its dependence on confounding factors such as phy-logeny and heterogeneous hot spot concentration werenot fully characterized.Clustering of mutations can be directly observed by plotting the fraction n ( r ) of pairs of mutations at dis-tance r from each other as a function r (normalized by thetotal number of pairs at that distance, see Fig. S7), whichis also called a spatial correlation function in physics. Fo-cusing on lineages with at least 6 leaves, and iteratingthrough all branches with fewer than 10 mutations, weevaluated this correlation function for pairs of mutationsoccurring in the same branch of the phylogeny versus dis-tant branches, as schematized in Fig. 5A). We then com-pared this correlation function to our model predictions(Fig. S8). The enrichment of closeby mutations can bequantified by the correlation function f ( r ) = n ( r ) /n m ( r ),where n ( r ), the fraction of pairs of mutations distant by r in the same tree branch is normalized by the modelprediction n m ( r ) (Fig. 5B).Pairs of mutations in distinct branches are well ex-plained by the model, suggesting that they are indepen-dent of each other, in agreement with the biological pic-ture that they occur at different rounds of affinity mat-uration. The enrichment of closeby mutations in distantbranches can be entirely explained by the clustering ofhotspot regions. Interestingly, both context and positiondependencies of the mutability are needed to explain thedata (Fig. S8). In contrast, pairs of mutations insidebranches tend to occur closer to each other than pre-dicted by the model. The enrichment of closeby muta-tions is up to five-fold, pointing to an additional mecha-nism causing hypermutation clustering. We observe thatthis enrichment persists in the presence of selection, asverified by computing the correlation function f ( r ) inproductive lineages (Fig. 5C). Minimal model of co-localization
To explain the observed excess of co-localized muta-tions, we propose a simple phenomenological model. Tar-geted mutations, following the context and position de-pendent profiles described so far, cause additional nearby‘follow-up’ mutations due to error-prone DNA repair.Given a substitution at x drawn from the same distri-bution as before, each position x (cid:54) = x can subsequentlymutate with probability p ( x | x ) = (cid:15) e −| x − x | /ξ , (3)where ξ is the correlation length and (cid:15) is small. The totalnumber of follow-up mutations is approximately Poissondistributed with mean (cid:80) x p ( x | x ) (cid:39) (cid:15)/ (1 − e − /ξ ). Tosimulate this process, we followed the same procedureas described earlier for synthetic data, but allowing forfollow-up, as well as targeted mutations, while keepingthe total number of mutations in each branch constant.We then computed the correlation function f ( r ), andcompared it to true profiles (Fig. 5B). We obtain a goodagreement for ξ = 10 and (cid:15) = 5% corresponding to anaverage of ∼ r C o rr e l a t i o n f un c t i o n f ( r ) Inside branchesBetween branchesModelof co-localization 1 25Distance r C o rr e l a t i o n f un c t i o n f ( r ) Inside branchesOut-of-frameProductive 326651Productive 326713
A B C
ACGTAGTCTTTACTGTGACACGTAGTGTTTCCTGTAAC
Pre-MRCAWithin selected branchMutationsOutsideFirst targetFollow-up x ⇠ µ ( x ) = w ( x ) x
12 are marked with dotted lines to guide the eye. Shaded areas represent 95% confidence intervals.
We asked whether this large number of non-targettedmutations may bias the inference of the targeting model,which assumes no follow-up mutations. To assess thiseffect, we re-inferred the rates { β x } and { γ w } from syn-thetic datasets simulated with ξ = 10, (cid:15) = 5%, withdata-inferred context profile γ w , and with data-inferredor flat position profiles β x (Fig. S9). We find that there-inferred mutabilities mostly agree with the true ones,with a slight shrinkage of values and enhanced muta-bilities of cold spots, owing to the equalizing effect co-localization. Importantly, co-localization does not in-troduce additional features in the re-inferred position-dependent profile, indicating that our inference proce-dure is robust to co-localization effects. III. DISCUSSION
The mutational landscape of antibody repertoires re-sults from many entangled effects, which are oftenlumped together into effective models of hypermutations[12, 16, 31]. First, hypermutations have intrinsic prefer-ences for certain positions along the IgH gene, regardlessof their impact on protein function. In addition, selectionfor antibody function, which includes protein stabilityand antigen affinity, favors beneficial mutations and sup-presses deleterious ones [13]. While intrinsic SHM pref-erences are believed to be universal, selective forces varyacross lineages which are involved in distinct immune re- sponses [21], and may also depend on the individual’simmune status [32]. Repertoire sequencing gives a snap-shot of a rapidly adapting population subjected to theseforces, making it hard to disentangle intrinsic SHM pref-erences from the combined effects of selection and geneticdrift. By focusing on non-productive lineages and using aphylogeny-based approach, we overcome the biases aris-ing from the dynamics of affinity maturation to obtain acomprehensive picture of SHM intrinsic preferences.Each hypermutation occurs through a series of eventsof DNA damage and repair. The action of each enzyme,including AID to error-prone DNA repair enzymes, mayeach have their own sequence preferences, and the in-terplay of these different biases results in the observedprofile. In our approach, these complex mutational path-ways are subsumed into an effective model with a limitednumber of interpretable parameters in terms of effectivecontext and position dependence. As a result, the contextdependent weights γ w do not simply reflect the bindingpreference of AID, but also account for the biases of otherbiochemical steps. Our framework enables direct mea-surement of the mutability γ w of a wide range of 5-mercontexts, recovering the known classifications of hot andcold spots [16, 28]. We show that our model outperformsexisting methods as well as purely context or positiondependent models in terms of explaining the data.The introduction of an explicit and universal positiondependence, β x , allows us to unveil an excess of muta-tions in the CDR1 and CDR2 regions. This enrichmentof mutations cannot be simply explained by their harbor-ing more hotspot contexts. We cannot exclude that thisresidual position dependence is due to more complex con-text effects missed by our model (based e.g. on 7-mers,which would be impractical to infer from the presentdataset). Alternatively, SHM may preferentially targetthese regions independently of their sequence context,possibly through epigenetic mechanisms. Such prefer-ence is known to exist at the genome-wide level to mu-tate the Ig loci without affecting other genes [13], so itis plausible that the same mechanism targets some spe-cific positions within Ig. The enrichment of mutationsin the CDR1 and CDR2 regions is even more markedin productive sequences, meaning that these mutationsare more likely to be selected during affinity maturation.This suggests that the intrinsically enhanced mutabilityof these regions may carry an evolutionary advantage, byfocusing hypermutations on regions where they are themost beneficial [30]. The stability of the immunoglobulinrelies on the FWR regions and most of the substitutionsare expected to be deleterious. The purifying nature ofselection in FWR regions has been quantified in Ref. [33]and contrasted with positive selection in CDR regions.By studying mutations along lineages, we were able tostudy mutations in the probable context in which theyoccurred, rather than relative to the germline sequence,allowing us to take into account the order of mutationsand to sample a broader diversity of 5-mer contexts. Thisapproach also allowed us to study and characterize hy-permutations in the CDR3, which has been neglectedin previous work [11] owing to the difficulty to separatethese mutations from junctional diversity.The phylogenetic methods employed in this study werenot specifically designed to study B cell repertoires. Inparticular the assumptions allowing for fast likelihoodcomputations do not account for the context dependenceof the mutation rate beyond the codon frame [23]. Theposition-dependent model introduced here could offer acompromise. While it does not account for the the fullcomplexity of SHM biases, it captures the variation of themutation rate observed in out-of-frame data well (Fig. 4),and can operate under the assumption of independentsite evolution. Our framework could also be easily ex-tended to include position-dependent selection in the nu-cleotide or amino acid representation.Our analysis confirmed a phenomenon of co-localization of mutations along the sequence. While thiseffect had been previously reported [22], here we showedthat it could not be explained by phylogenetic bias or theexistence of regions of higher and lower mutabilities. Weproposed a minimal quantitative model of hypermuta-tion targeting, followed by error-prone DNA repair thatcauses follow-up mutations, which explains the data well.While ideally we would like to infer the position and con-text mutability profiles taking these follow-up mutationsinto account, the task is impractical because it wouldrequire to identify the origin of each mutation. We ex-pect that doing so would only renormalize the values of the context preferences. While the adaptive advantageof co-localized mutations is unclear, we find the corre-lation function in productive lineages follows the unpro-ductive baseline with additional enrichment enhanced atmultiples of the codon length, 3, suggesting signaturesof selection (Fig. 5C). We speculate that nearby muta-tions occurring simultaneously could help cross barriersof positive sign epistasis, whereby two or more mutationsare deleterious by themselves, but beneficial together.This phenomenon could accelerate affinity maturation byfavoring compensatory or epistatic mutations at aminoacids that interact strongly within the antibody protein[34, 35].The obtained mutability models make predictionsabout the likelihood and plausibility of particular tra-jectories of affinity maturation. They could be useful indesigning vaccination strategy, by helping choose targetswith a greater potential for accumulating beneficial mu-tations towards antibodies with desired properties suchas neutralization power, or broadness in the case of fastevolving pathogens such as influenza or HIV [11, 36]. IV. METHODSData and preprocessing
We perform the analysis on recently published high-throughput RNA sequencing of Ig heavy genes at greatdepth [7].The sequences were barcoded with unique molecularidentifiers (UMI) to correct for the PCR amplificationbias. However, UMI cannot be used to correct sequenc-ing errors, as most UMI were represented by a singlesequence: the number of UMIs used is of the same or-der as the total number of cells in use. We aligned rawsequences using presto of the Immcantation pipeline [37]with setup allowing to correct for errors in UMIs and dealwith insufficient UMI diversity. The V region primerswere masked and the C region primers were used to dis-tinguish the two isotypes of sampled B cells: the IgMand IgG classes. The study of mutation profiles in thetwo groups revealed a much lower mutational load in theIgM cohort and hence a higher relative level of sequencingerrors, as well as shallower tree topologies. For furtheranalysis we chose to focus exclusively on the IgG class.Reads were filtered for quality and paired using defaultpresto parameters. Pre-processed data was then alignedto V, D and J templates from IMGT [38] database usingIgBlast [39]. In total there were 3 . × IgG sequencesper person (average 3 . × , median 1 . × ), of whichup to 2% were unproductive (average 5 . × , median2 . × ).0 Inference of evolutionary trajectories
Sequences with a frameshift in the CDR3 region werethen selected and used to reconstruct clonal families asfollows. In the first step, reads were aligned to the Vand J templates and grouped into classes of sequenceswith the same V and J gene assigned, as well as equalCDR3 length. In the out-of-frame classes we inferredclonal lineages by single linkage clustering with a thresh-old of 90% on CDR3 region identity [40]. We reconstructmaximum likelihood topologies, as well as the identity ofancestral states, under a simple K80 model of characterevolution [41] for all lineages comprising at least 6 uniquesequences. The model does not capture the complexity ofthe observed mutation profile, but avoids fitting multipleparameters independently in small lineages of relativelyshort alignment. The existing repertoire-wide method[23] is incompatible with out-of-frame lineages since itoperates on 61 productive codons. Ancestral states arefound through marginal reconstruction. Germline V andJ sequences were used as an outgroup to inform the phy-logenetic inference and root the lineage.
Model inference
With the exception of the initial branch, which joinsthe germline sequence and the most recent common an-cestor of the lineage, all branches shorter than 10 substi-tutions were used for model inference.Our task is to find a set of parameters { γ w } , { β x } thatmaximise the log-likelihood L = (cid:88) S,T ;( i,j ) ∈ T (cid:88) x (cid:48) | s i ( x (cid:48) ) (cid:54) = s j ( x (cid:48) ) ln (cid:16) e γ w ( si,x (cid:48) ) β x (cid:48) t − (cid:17) − (cid:88) x γ w ( s i ,x ) β x t (cid:35) , (4)where S is the set of sequences (observed and recon-structed) at each node of the tree, and T encodes the re-constructed phylogenetic tree through its branches ( i, j ),with reconstructed ancestral states s i and s j . The rates µ s,x = β x γ w ( s,x ) are defined so that the length of eachbranch t is expressed in terms of the expected numberof substitutions per basepair (total number of substitu-tions divided by the total alignment length). Imposing ∂ L /∂γ w = 0 yields an implicit expression for γ w as afunction of { β x } , but independent of { γ w (cid:48) } w (cid:48) (cid:54) = w , whichcan be solved by one-dimensional root finding. Likewise,setting ∂ L /∂β x = 0 gives an implicit expression for β x as a function of { γ w } . We can perform the followingiteration: γ n = arg max γ L ( γ, β n ) (5) β n +1 = arg max β L ( γ n , β ) , (6) which converges to the maximum of L with respect tothe joint { γ w , β x } .To estimate the uncertainty of inferred parameters wesample with replacement from the set of all branches tocreate 400 bootstrap copies. We report 95% confidenceintervals. Substitution models
Not only the targeting rate, but also the identity ofthe substitution is known to depend on the identity ofneighboring bases [16]. In our formulation of the model,inference of the targeting rates does not require knowingthe substitution type, however we can easily extend theframework to include this dependence. The probabilityof mutating from w to w (cid:48) over a period t can be expressedas P ( w → w (cid:48) | t ) = 1 − e − γ w ω w → w (cid:48) β x t , (7)where (cid:80) w (cid:48) ω w → w (cid:48) = 1 and ω w → w (cid:48) (cid:54) = 0 if w (cid:48) is a result ofa substitution at the central position of w . This way weadd 2 × = 2048 parameters to the model. We can inferthe maximum likelihood estimates of ω w → w (cid:48) using thesame iterative scheme introduced in the previous section. Synthetic datasets
We created synthetic datasets using the S5F model ofmutability (downloaded from clip.med.yale.edu/shm )for γ w . We used a flat profile, β x = 1 as well as sinusoidalprofiles ln β x = 2 sin( x/δ ) − β x = 2 cos( x/δ ) − δ = 50. For each branch ( i, j ), we compute themutability µ s i ,x as a function of x , and then introducemutations at n random positions picked without replace-ment according to µ x / (cid:80) x (cid:48) µ x (cid:48) , where n is the number ofmutations on the branch (fixed by the lineage structuretaken from the real data). Data availibility
All the data analyzed in this paper has been previouslypublished and can be accessed from original publications.Code for producing the figures of this paper, as well as theinferred model parameters, are freely available at https://github.com/statbiophys/shmoof . Acknowledgements
The study was supported by the European ResearchCouncil COG 724208. The authors are grateful for thediscussions and suggestions from Thomas Dupic, QuentinMarcou and Victor Chard`es.1 [1] Hozumi N, Tonegawa S (1976) Evidence for somatic re-arrangement of immunoglobulin genes coding for vari-able and constant regions.
Proceedings of the NationalAcademy of Sciences
Sci Transl Med
Proceedingsof the National Academy of Sciences { IgH } RepertoiresRevealed by Deep Sequencing.
The Journal of Immunol-ogy
Philos Trans R Soc Lond, B,Biol Sci
Plos One
Nature
Philos Trans R Soc Lond, B, BiolSci
Immunity
Jour-nal of immunology (Baltimore, Md. : 1950)
Cell
Frontiers in Immunology
Trendsin Immunology pp 1–15.[14] Weinstein JA, Jiang N, White RA, Fisher DS, QuakeSR (2009) High-throughput sequencing of the zebrafishantibody repertoire.
Science
Nat Biotechnol
Frontiers in Immunology
The Journal of Immunology
Philos Trans R SocLond, B, Biol Sci
Frontiers in Im-munology
Ge-netics
PLOS Computational Biology
Na-ture Communications
Proceedings ofthe National Academy of Sciences of the United States ofAmerica
Journal ofMolecular Evolution
Bioinformatics
Pro-ceedings of the National Academy of Sciences
Fron-tiers in Immunology
Sci-ence
Nature
Molecular Immunology
Journal of Autoimmunity
MolecularBiology and Evolution
Journal of Biological Chemistry
Cell Systems pp 1–8. [36] Liao HX, et al. (2013) Co-evolution of a broadly neutral-izing HIV-1 antibody and founder virus. Nature
Bioinformatics
Nucleic Acids Research
Nucleic Acids Research
Bioinformatics
Journal of MolecularEvolution Supplementary information True context mutabilities10 I n f e rr e d m u t a b ili t i e s S5F with = 1S5F with = x
100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = 1True profile 100 200 300 400Position x Recovered = x Recovered = x True profiles
A B C
CDR3CDR3
FIG. S1: Model inference on synthetic data using true phylogenies. A. Inference of context mutabilities γ . B,C. Inference ofposition mutabilities β for flat and sinusoidal profiles. Error bars correspond to 95% confidence intervals. Frequency at whicha given position belongs to the CDR3 region is indicated with the grey shaded areas. Mutabilities w M u t a b ili t i e s ¯ w A/T-centered 5-mers10 Mutabilities w M u t a b ili t i e s ¯ w C/G-centered 5-mers
A B
FIG. S2: Strand asymmetry of the context-dependent rates. For each motif w we juxtapose its mutability γ w with the mutabilityof its reverse complement γ ¯ w . A. 5-mer motifs with strong central nucleotide (C/G), r = 54%. B. 5-mer motifs with weakcentral nucleotide (A/T), r = 7%
100 200 300 400Distance x from V5’ end0 . . . . . . . M u t a b ili t i e s
300 200 100 0Distance x r from J3’ end0 . . . . . . . . A B
CDR1 CDR2 CDR3 CDR1 CDR2 CDR3
FIG. S3: Alternative position definition. We compare the models based on different position definitions: x , the distance fromthe 5 (cid:48) end of the V gene (A, red) and x r , distance from the 3 (cid:48) end of the J gene (B, green). Error bars correspond to 95%confidence intervals. Frequency at which a given position belongs to a CDR region is indicated with the grey shaded areas. Mutabilities ( = 1)10 M u t a b ili t i e s ( = x ) = 1 versus = x FIG. S4: Introducing explicit position dependence β (cid:54) = 1 influences the context-dependent rates. We compare the γ mutabilitiesfrom the full model with the parameters of the purely context-dependent model. w sorted by M u t a b ili t i e s x sorted by M u t a b ili t i e s A B
FIG. S5: Analysis of the uncertainty of estimate parameters γ (A) and β (B). The shaded area indicates the 95% confidenceinterval envelope. Motifs and positions are sorted by their respective mutabilities
100 150 200 250 300Position in the V gene, x IMGT . . . . . . F r a c t i o n o f m u t a t i o n s IGHV3-23ObservedExpected -
23 3 - - - - - - - -
74 1 - - - - - - -
69 6 - - -
53 3 - IGHV genes (20 most frequent)0 . . . . . . . . . P e a r s o n r . . . . . . . . . . F r a c t i o n o f m u t a t i o n s ObservedExpected Cys104 Try118Position in CDR30 . . . . . . F r a c t i o n o f m u t a t i o n s CDR3 length of 50 ntObservedExpected 20 40 60 80 100CDR3 length0 . . . . . . . P e a r s o n r Mutabilities w S F m u t a b ili t i e s CDR1 FWR2 CDR2 FWR3 CDR3 FWR40 . . . . . . P e a r s o n r Models w x w x S5F
A BC D EGF
CDR1 CDR2
FIG. S6: Model performance when model was trained on 2/3 of data, and tested on the remaining 1/3. A. Example mutationprofile in the most common V gene. B. Model performance across V genes. C. Mutation profile in the FWR4 region. D.Example mutation profile in the CDR3 region for CDR3 length of 50 nts. E. Model performance across CDR3 lengths. F.Comparison with the S5F model. G. Summary of models performance across sequence regions. r F r a c t i o n o f p a i r s Inside branchesBetween branchesModel w x FIG. S7: Fraction of pairs n ( r ) of mutations encoded in the same tree branches (red) and pairs of mutations between differentbranches (green) normalized by the total number of pairs at that distance, ( l − r ) / (cid:0) l (cid:1) , where l stands for the alignment length.Shaded area corresponds to 95% confidence interval. r F r a c t i o n o f p a i r s Between branchesModel w x Model x Model w FIG. S8: Normalized fraction n ( r ) of pairs of mutations between different branches (green). Context- and position-dependentmodels prediction n m ( r ) for this quantity are presented for the full ( µ s,x = γ w β x ), purely context-dependent ( µ s,x = γ w ) andpurely position-dependent ( µ s,x = β x ) models. Shaded area corresponds to 95% confidence interval. True context mutabilities10 I n f e rr e d m u t a b ili t i e s with = 1 0 100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = 1True profile10 True context mutabilities10 I n f e rr e d m u t a b ili t i e s with = ( x ) 100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = ( x )True profile A BC D
FIG. S9: Model inference on synthetic data with correlated mutations using true phylogenies. Mutations were drawn accordingto the co-localization model (3) with (cid:15) = 5% and ξ = 10 with data-derived context and position dependent mutabilities. A,C.Inference of context mutabilities γ . B,D. Inference of position mutabilities ββ