[PDF] Learning the heterogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data

Abstract

Somatic hypermutations of immunoglobulin (Ig) genes occuring during affinity maturation drive B-cell receptors' ability to evolve strong binding to their antigenic targets. The landscape of these mutations is highly heterogeneous, with certain regions of the Ig gene being preferentially targeted. However, a rigorous quantification of this bias has been difficult because of phylogenetic correlations between sequences and the interference of selective forces. Here, we present an approach that corrects for these issues, and use it to learn a model of hypermutation preferences from a recently published large IgH repertoire dataset. The obtained model predicts mutation profiles accurately and in a reproducible way, including in the previously uncharacterized Complementarity Determining Region 3, revealing that both the sequence context of the mutation and its absolute position along the gene are important. In addition, we show that hypermutations occurring concomittantly along B-cell lineages tend to co-localize, suggesting a possible mechanism for accelerating affinity maturation.

Full PDF

LLearning the heterogeneous hypermutation landscape of immunoglobulins fromhigh-throughput repertoire data

Natanael Spisak, Aleksandra M. Walczak, ∗ and Thierry Mora ∗ Laboratoire de physique de l’ ´Ecole normale sup´erieure, CNRS, PSL University,Sorbonne Universit´e, and Universit´e de Paris, 75005 Paris, France

Somatic hypermutations of immunoglobulin (Ig) genes occurring during aﬃnity maturation driveB-cell receptors’ ability to evolve strong binding to their antigenic targets. The landscape of thesemutations is highly heterogeneous, with certain regions of the Ig gene being preferentially targeted.However, a rigorous quantiﬁcation of this bias has been diﬃcult because of phylogenetic correlationsbetween sequences and the interference of selective forces. Here, we present an approach that correctsfor these issues, and use it to learn a model of hypermutation preferences from a recently publishedlarge IgH repertoire dataset. The obtained model predicts mutation proﬁles accurately and in areproducible way, including in the previously uncharacterized Complementarity Determining Region3, revealing that both the sequence context of the mutation and its absolute position along the geneare important. In addition, we show that hypermutations occurring concomittantly along B-celllineages tend to co-localize, suggesting a possible mechanism for accelerating aﬃnity maturation.

I. INTRODUCTION

B cells are a crucial player in the adaptive immunesystem. Swift eradication of pathogens is enabled by theproduction of immunoglobulins (Ig) that bind tightly toantigens, helping in their detection, neutralization, andremoval. Achieving high accuracy and breadth relies onthe extraordinary diversity of the B cells repertoire. Theprocess of V(D)J recombination results in a highly di-verse population of naive cells [1–7]. In addition, B cellsundergo aﬃnity maturation, a Darwinian process [8] inwhich mutations are introduced to the immunoglobulin-coding gene and highest aﬃnity mutants are selected [9].This process is driven by a very high rate of somatichypermutations (SHM), ∼ − per basepair per cell di-vision [10], targeting the Ig genes. Some receptor genescan ultimately accumulate up to 30% amino acid substi-tutions, considerably altering the initial genotype. Thebroad diversity created by SHM ultimately ensures theemergence and selection of strong antigen binders. Un-derstanding SHM and their statistics is key to designingbetter vaccination strategies [11, 12].Like the VDJ recombination process, SHM are charac-terized by heterogeneous preferences. Mutational path-ways aﬀect the Ig genes unevenly, with ‘cold’ and ‘hot’spots along the receptor gene, even before somatic se-lection introduces further biases [12]. SHM is initi-ated by Activation-Induced cytidine Deaminase (AID)through the deamination of deoxycytidines triggering anarray of error-prone repair pathways [13]. AID and re-pair enzymes preferentially target certain regions of thegene. However, a quantitative picture of how these pro-cesses and their context dependencies result in the ob-served heterogeneous mutational landscape is lacking.High-throughput repertoire sequencing of the Ig gene ∗ Corresponding authors. These authors contributed equally. [2, 3, 14, 15] has facilitated the development of eﬀectivemodels from a detailed analysis of mutational proﬁles ofIg sequences before [5, 16, 17] or after selection [18–22].However, the spatial organization of mutations, their con-text preferences, and their interplay with selection duringaﬃnity maturation are still poorly understood, in partdue to a number of confounding factors.A fundamental issue is the bias of selection, whichfavors beneﬁcial mutations over deleterious ones in theobserved repertoire. This bias can be partially circum-vented by analyzing synonymous substitutions [16], withthe limitation that extrapolation is required to generalizeto non-synonymous ones. Another way around selectionis to study passenger nonproductive sequences, which areunsuccessful products of VDJ recombination and thusunaﬀected by selection [5, 17, 22]. These sequences makeup a minority of DNA sequences, and are rarely foundin mRNA sequences because of allelic exclusion, whichlimits their use to very large datasets.Another confounding factor arises from phylogeneticbiases due to the complex multi-lineage structure of therepertoire. While methods have been developed to infersubstitution rates from lineages in a lineage-speciﬁc [21]or repertoire-wide way [23], they do not aim to correctfor selection and do not address the question of hyper-mutation targeting.Here we propose a new framework for quantifyingand predicting immunoglobulin mutability. The modelis trained on the reconstructed phylogenies of nonpro-ductive lineages from very large published B cell reper-toires totalling around half a million nonproductive se-quences [7], allowing us to overcome previous limitationsof dataset sizes. The approach accounts for both phyloge-netic and selection biases, and allows us to study in detailthe spatial and context preferences of hypermutation tar-geting, and to reveal the co-localization of contemporarymutations. a r X i v : . [ q - b i o . GN ] S e p II. RESULTSRepertoire-wide framework to model intrinsicmutabilities from out-of-frame lineages

Out-of-frame Ig sequences are byproducts of the VDJrecombination process that are made non functional bya frameshift in the CDR3 region. Since each cell has twocopies of the Ig genes, out-of-frame rearrangements maysurvive in the cell if recombination on the second chro-mosome is successful. The mechanism of allelic exclusionensures that only the functional variant is expressed. Yet,out-of-frame IgH sequences comprise ∼

2% of rearrange-ments in Ig mRNA sequencing experiments, and ∼

9% ingenomic DNA [6]. When a B-cell clone harboring bothan out-of-frame and a functional rearrangement under-goes aﬃnity maturation, the out-of-frame sequence actsas a passenger and mutates alongside the functional se-quence, with the selection pressure acting only on thelatter. While the two sequences share the same phy-logeny, mutations found in out-of-frame lineages are notexpected to be subject to selection.To model the process of SHM, we reconstruct the evo-lutionary history giving rise to the observed mutationpatterns in nonproductive rearrangements. We analyseddata consisting of the IgG repertoires of 9 individualsfrom Ref. [7], obtained by the targeted mRNA sequencingof the Ig heavy (IgH) chain locus. We pre-processed andaligned raw IgH sequences to keep only out-of-frame se-quences. We then grouped sequences into clonal familiesthat originate from the same ancestor using single linkageclustering (Fig. 1A). The size of clonal families typicallyfollows a power-law distribution (Fig. 1C). As a result,many lineages are represented by one or a few sequences.We focused on suﬃciently large lineages (comprised of atleast 6 distinct sequences) and reconstructed their lineagestructure, using maximum likelihood [24, 25] to infer thetopology of the underlying tree, and marginal reconstruc-tion for the identity of ancestral states. This provides uswith a list > ,

000 mutation events occurring betweenthe most recent common ancestor of lineages and theirleaves.Using lineage information is essential for multiple rea-sons. First, it allows for a better estimate of the sequencecontext in which a mutation appears. In this paper wedeﬁne the context as the 5-mer sequence comprising themutated basepair ﬂanked by 2 basepairs on each side.In the absence of lineage information, the best guess forthe 5-mer context would be given by the genomic se-quence of the V, D, or J segment where the mutationarose. But that context may itself be aﬀected by otherprior mutations. The tree structure allows us to identifythe order of mutations and reconstruct the probable 5-mer context in which each mutation occurred. Second,for the same reason, the tree structure can help identifymutations in the hyper-variable CDR3 region, includingin the junctions made of nontemplated insertions. Thismakes it possible to estimate the hypermutation rate in these regions. Together, these improvements mean thatmutations can be identiﬁed within a broader range of 5-mer contexts, and their corresponding mutabilities betterestimated. Third, lineage structure helps reduce contam-ination from sequences that have been under some selec-tion. In some rare events, during aﬃnity maturation asomatic insertion or deletion may be introduced in theCDR3 of a productive sequence, which would lead us toclassify it as out-of-frame, even though it has been sub-ject to selection prior to the frame-shift event. Focusingon mutations happening downstream of the most recentcommon ancestor, which is already out of frame, help usdiscards those contaminating events.Given a model P ( s → s (cid:48) | t, θ ) of sequence evolutionfrom s to s (cid:48) , where t is fraction of mutated positionsbetween s and s (cid:48) , (called branch length, equal to thenumber of mutations divided by alignment length), and θ denotes model parameters, we can write the joint like-lihood of mutation events in each lineage as P ( S | T, θ ) = (cid:89) ( i,j ) ∈ T P ( s i → s j | t j , θ ) , (1)where S is the set of sequences (observed and recon-structed) at each node of the tree, and T encodes the re-constructed phylogenetic tree through its branches ( i, j ).We assume every position x of the sequence s evolvesindependently inside each branch. Mutations occur ac-cording to a set of Poisson clocks with sequence- andposition-dependent rates, µ s,x , expressed per unit timeof branch length. During t some positions will mutateand others will remain unchanged, so that P ( s → s (cid:48) | t ) = (cid:89) x | s ( x ) (cid:54) = s (cid:48) ( x ) (1 − e − µ s,x t ) (cid:89) x | s ( x )= s (cid:48) ( x ) e − µ s,x t . (2)We assume that mutability depends independently on thelocal 5-mer sequence context centered around the muta-tion, w ( s, x ) = ( s ( x − , . . . , s ( x + 2)), and on the abso-lute position x along the gene (measured as the distancefrom the 5 (cid:48) end of the gene), so that µ s,x = γ w ( s,x ) β x . Inabsence of context and position dependence, we wouldhave µ = 1 by construction. Thus values of γ w or β x above 1 imply higher mutabilities than average, and viceversa for values below 1. To lift the degeneracy in overallscale between β x and γ w , we impose (cid:104) β x (cid:105) = 1.Overall, the model has 4 = 1024 parameters for γ w corresponding to each 5-mer, and L = 400 parametersfor β x corresponding to each possible position. We inferthese parameters from repertoire-wide sequencing databy maximizing the total log-likelihood of mutations in allbranches in all lineages, L ( β, γ ) = (cid:80) ( S,T ) ln P ( S | T, β, γ ),with respect to ( β, γ ), using an iterative procedure.

Validation on synthetic data

We ﬁrst tested the ability of the inference frameworkto recover true mutability parameters using synthetic Clonal family size10 D i s tr i bu t i o n . . . . . . . D i s tr i bu t i o n Donors316188326650326651326713326737326780326797326907327059

A B C DE

Pre-MRCAPost-MRCAMutations

ACGTGACCGTAGTCTTTACTGTGACACGTGAACGTAGTCTTTACTGTGAC w ( s, x ) AAAB7XicbVDLSgNBEOyNrxhf8XHzMhiECBJ2o6DHgBePEcwDkiXMTmaTMbMzy8ysGpb8gxcPinj1f7z5N042HjSxoKGo6qa7K4g508Z1v5zc0vLK6lp+vbCxubW9U9zda2qZKEIbRHKp2gHWlDNBG4YZTtuxojgKOG0Fo6up37qnSjMpbs04pn6EB4KFjGBjpeZDWZ8+nvSKJbfiZkCLxPshpdpBmKHeK352+5IkERWGcKx1x3Nj46dYGUY4nRS6iaYxJiM8oB1LBY6o9tPs2gk6tkofhVLZEgZl6u+JFEdaj6PAdkbYDPW8NxX/8zqJCS/9lIk4MVSQ2aIw4chINH0d9ZmixPCxJZgoZm9FZIgVJsYGVLAhePMvL5JmteKdVao3No1zmCEPh3AEZfDgAmpwDXVoAIE7eIIXeHWk8+y8Oe+z1pzzM7MPf+B8fANnhJE/ µ s,x = w ( s,x ) x AAACC3icbVDLSgNBEJyNrxhfUY96GBKECBJ2E0EvQsCLRwVjAtmw9E4myZCZ3WVmVg1L7l78A09+gBcFRbz6A570b5w8DhotaCiquunu8iPOlLbtLys1Mzs3v5BezCwtr6yuZdc3LlQYS0KrJOShrPugKGcBrWqmOa1HkoLwOa35veOhX7ukUrEwONf9iDYFdALWZgS0kbxszhWxl6i96wE+wm4HhAAvuSoYYXfg+lSDd+1l83bRHgH/Jc6E5CvbnwdP9+W7Uy/74bZCEgsaaMJBqYZjR7qZgNSMcDrIuLGiEZAedGjD0AAEVc1k9MsA7xilhduhNBVoPFJ/TiQglOoL33QK0F017Q3F/7xGrNuHzYQFUaxpQMaL2jHHOsTDYHCLSUo07xsCRDJzKyZdkEC0iS9jQnCmX/5LLkpFp1wsnZk09tEYabSFcqiAHHSAKugEnaIqIugGPaBn9GLdWo/Wq/U2bk1Zk5lN9AvW+zcKS54O x AAAB6HicbZDLSgNBEEVr4ivGV9Slm8YguAozUdCdATcuEzAPSIbQ06lJ2vQ86O4Rw5AvcONCkbj0L/wNd/6NPUkWGr3QcLi3iq4qLxZcadv+snIrq2vrG/nNwtb2zu5ecf+gqaJEMmywSESy7VGFgofY0FwLbMcSaeAJbHmj6yxv3aNUPApv9ThGN6CDkPucUW2s+kOvWLLL9kzkLzgLKF19TDO91XrFz24/YkmAoWaCKtVx7Fi7KZWaM4GTQjdRGFM2ogPsGAxpgMpNZ4NOyIlx+sSPpHmhJjP3Z0dKA6XGgWcqA6qHajnLzP+yTqL9SzflYZxoDNn8Iz8RREck25r0uUSmxdgAZZKbWQkbUkmZNrcpmCM4yyv/hWal7JyVK3W7VD2HufJwBMdwCg5cQBVuoAYNYIDwCM/wYt1ZT9arNZ2X5qxFzyH8kvX+DQP7kX0= s AAAB6HicbZDLSsNAFIZP6q3WW9Wlm2ARXJWkCrqz4MZlC/YCbSiT6Uk7djIJMxOhhD6BGxeK1KVv4Wu4822ctF1o6w8DH/9/DnPO8WPOlHacbyu3tr6xuZXfLuzs7u0fFA+PmipKJMUGjXgk2z5RyJnAhmaaYzuWSEKfY8sf3WZ56xGlYpG41+MYvZAMBAsYJdpYddUrlpyyM5O9Cu4CSjef00zvtV7xq9uPaBKi0JQTpTquE2svJVIzynFS6CYKY0JHZIAdg4KEqLx0NujEPjNO3w4iaZ7Q9sz93ZGSUKlx6JvKkOihWs4y87+sk+jg2kuZiBONgs4/ChJu68jOtrb7TCLVfGyAUMnMrDYdEkmoNrcpmCO4yyuvQrNSdi/KlbpTql7CXHk4gVM4BxeuoAp3UIMGUEB4ghd4tR6sZ+vNms5Lc9ai5xj+yPr4AfxYkXg= s AAAB6XicbZDLSgMxFIbP1Futt6pLN8EiuiozVdCdBTcuq9gLtEPJpJk2NJMMSUYoQ9/AjQtFu/UpfA13vo2Ztgtt/SHw8f/nkHNOEHOmjet+O7mV1bX1jfxmYWt7Z3evuH/Q0DJRhNaJ5FK1AqwpZ4LWDTOctmJFcRRw2gyGN1nefKRKMykezCimfoT7goWMYGOte33aLZbcsjsVWgZvDqXrz/dMk1q3+NXpSZJEVBjCsdZtz42Nn2JlGOF0XOgkmsaYDHGfti0KHFHtp9NJx+jEOj0USmWfMGjq/u5IcaT1KApsZYTNQC9mmflf1k5MeOWnTMSJoYLMPgoTjoxE2dqoxxQlho8sYKKYnRWRAVaYGHucgj2Ct7jyMjQqZe+8XLlzS9ULmCkPR3AMZ+DBJVThFmpQBwIhPMELvDpD59l5cyaz0pwz7zmEP3I+fgBc2pGp t AAAB6HicbZDLSsNAFIZP6q3WW9Wlm2ARXJWkCrqz4MZlC/YCbSiT6aQdO5mEmROhhD6BGxeK1KVv4Wu4822ctF1o6w8DH/9/DnPO8WPBNTrOt5VbW9/Y3MpvF3Z29/YPiodHTR0lirIGjUSk2j7RTHDJGshRsHasGAl9wVr+6DbLW49MaR7JexzHzAvJQPKAU4LGqmOvWHLKzkz2KrgLKN18TjO913rFr24/oknIJFJBtO64ToxeShRyKtik0E00iwkdkQHrGJQkZNpLZ4NO7DPj9O0gUuZJtGfu746UhFqPQ99UhgSHejnLzP+yToLBtZdyGSfIJJ1/FCTCxsjOtrb7XDGKYmyAUMXNrDYdEkUomtsUzBHc5ZVXoVkpuxflSt0pVS9hrjycwCmcgwtXUIU7qEEDKDB4ghd4tR6sZ+vNms5Lc9ai5xj+yPr4Af3ckXk= i AAAB6HicbZDLSsNAFIZP6q3WW9Wlm2ARXJWkCrqz4MZlC/YCbSiT6Uk7djIJMxOhhD6BGxeK1KVv4Wu4822ctF1o6w8DH/9/DnPO8WPOlHacbyu3tr6xuZXfLuzs7u0fFA+PmipKJMUGjXgk2z5RyJnAhmaaYzuWSEKfY8sf3WZ56xGlYpG41+MYvZAMBAsYJdpYddYrlpyyM5O9Cu4CSjef00zvtV7xq9uPaBKi0JQTpTquE2svJVIzynFS6CYKY0JHZIAdg4KEqLx0NujEPjNO3w4iaZ7Q9sz93ZGSUKlx6JvKkOihWs4y87+sk+jg2kuZiBONgs4/ChJu68jOtrb7TCLVfGyAUMnMrDYdEkmoNrcpmCO4yyuvQrNSdi/KlbpTql7CXHk4gVM4BxeuoAp3UIMGUEB4ghd4tR6sZ+vNms5Lc9ai5xj+yPr4Ae0wkW4= j AAAB6HicbVDJSgNBEK1xjeOSqEcvjUHQS5iJgt4MiOAxAbNAMoSeTk3SSc9Cd48QQr7AiwdFvPob/oU3P8G/sLMcNPFBweO9Kqrq+YngSjvOl7Wyura+sZnZsrd3dveyuf2DmopTybDKYhHLhk8VCh5hVXMtsJFIpKEvsO4PbiZ+/QGl4nF0r4cJeiHtRjzgjGojVfrtXN4pOFOQZeLOSb6U/b4+sz9uy+3cZ6sTszTESDNBlWq6TqK9EZWaM4Fju5UqTCgb0C42DY1oiMobTQ8dkxOjdEgQS1ORJlP198SIhkoNQ990hlT31KI3Ef/zmqkOrrwRj5JUY8Rmi4JUEB2TydekwyUyLYaGUCa5uZWwHpWUaZONbUJwF19eJrViwT0vFCsmjQuYIQNHcAyn4MIllOAOylAFBgiP8AwvVt96sl6tt1nrijWfOYQ/sN5/AGJ0j4o= t j AAAB6nicbVDLSgNBEOyNrxhfUQ8evCwGwVPYjYIeA170FtE8IFni7GQ2GTM7u8z0CmHJJ3jxoIhX/8H/8OYPePQbnDwOmljQUFR1093lx4JrdJxPK7OwuLS8kl3Nra1vbG7lt3dqOkoUZVUaiUg1fKKZ4JJVkaNgjVgxEvqC1f3++civ3zOleSRvcBAzLyRdyQNOCRrpGtt37XzBKTpj2PPEnZJCee/y6/s9c1tp5z9anYgmIZNIBdG66ToxeilRyKlgw1wr0SwmtE+6rGmoJCHTXjo+dWgfGqVjB5EyJdEeq78nUhJqPQh90xkS7OlZbyT+5zUTDM68lMs4QSbpZFGQCBsje/S33eGKURQDQwhV3Nxq0x5RhKJJJ2dCcGdfnie1UtE9LpauTBonMEEW9uEAjsCFUyjDBVSgChS68ABP8GwJ69F6sV4nrRlrOrMLf2C9/QAeeZFU FIG. 1: Repertoire-wide framework to model somatic hypermutations (SHM) in out-of-frame lineages. A. Examples of out-of-frame clonal families. B. An example of a phylogeny with marked pre- and post-Most Recent Common Ancestor (MRCA)mutations. Only post-MRCA mutations are used for learning the somatic hypermutation model. C. Clonal family size distri-bution. Phylogenies were inferred for families with more than 5 unique sequences. D. Branch length distribution. Mutationsencoded in branches shorter than 10 mutations were used for model inference. E. Context and position dependence of themutation rate across a sequence. In this example the underlined C mutates to A, in the GACCG context denoted as w ( s, x ).The mutation rate µ s,x depends both on the sequence s and the position x , through the context dependent rate γ w ( s,x ) andthe position dependent rate β x . datasets. Synthetic data was designed to mimic as closelyas possible the features of the real repertoire data to beanalyzed. We used tree topologies inferred on out-of-frame lineages from 9 individuals of Ref. [7]. The se-quence at the root of each tree was replaced by a ran-dom sequence drawn using IGoR, a model of stochasticVDJ recombination [22]. Random mutations were thenintroduced along the tree structure, following the samenumber of mutations on each branch as in the original lin-eage, and according to the SHM model (Eq. 2). Context-dependent parameters γ w were set to the previously pub-lished S5F model [16] , and three variants of the positiondependent β x were tested: ﬂat, and two sinusoidal pro-ﬁles (see Methods). Finally we collected sequences at theleaves of the trees into a synthetic dataset.Starting from this dataset, we performed alignment,clonal family inference, tree reconstruction and ﬁnallymodel inference using the exact same procedure as forreal data. We compared parameters inferred this wayto the true values of γ w and β x (Fig. 2). We were ableto recover these rates with excellent accuracy (Pearson’s r = 97% for both γ and β ).The fact that the procedure recovers the correctposition-dependent proﬁle β x , including a ﬂat one(Fig. 2B), shows that the framework successfully cor- rects for the two following confounding factors. First,sequence conservation across the diﬀerent V, D, and Jsegments means that context and position are often in-tertwined, making the extraction of each dependence dif-ﬁcult. Second, high variability in the CDR3 may causeerrors in the assignment of sequences into clonal families,and makes it harder to reliably call mutations than in thegermline regions. This remains true in the presence oflarge variations of the mutability along the position, in-cluding in the CDR3, as demonstrated on the sinusoidalproﬁles (Fig. 2C). On the other hand, the possibility touse the CDR3 sequence for model inference gives accessto a more diverse range of possible contexts, leading tobetter estimates for contexts that are underrepresentedin the germline genes.To assess the impact of errors in the reconstruction ofclonal families and lineages on the inferred parameters,we repeated the procedure using the true tree topologiesinstead of the reconstructed ones. This only modestlyimproved accuracy ( r = 98%, see Fig. S1), suggestingthat the procedure is robust to lineage misassignments. True context mutabilities10 I n f e rr e d m u t a b ili t i e s S5F with = 1S5F with = x

100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = 1True proﬁle 100 200 300 400Position x Recovered = x Recovered = x True proﬁles

A B C

CDR3 CDR3

FIG. 2: Validation of the SHM model inference framework on synthetic data generated with the S5F model [16]. A. Inference ofcontext mutabilities γ w . B,C: Inference of position mutabilities β x for ﬂat (B) and sinusoidal (C) proﬁles. Error bars correspondto 95% conﬁdence intervals. The frequency at which a given position belongs to the CDR3 region is indicated with the greyshaded area. Mutabilities depend on both sequence context andposition

Conﬁdent that our procedure is able to infer rates re-liably, we next applied it to real data, consisting of theout-of-frame lineages from Ref. [7]. The inferred depen-dencies of mutability with context and position are pre-sented in Fig. 3. We represent context dependence usinga ﬂat variant of the “hedgehog” plots used in Ref. [16],for A-, T- , C-, and G-centered motifs (Fig. 3A-D). Fullparameter tables are available at https://github.com/statbiophys/shmoof .Context dependent rates for A-centered motifs corre-spond well to the standard WA classiﬁcation [26]: 76%of A-centered 5-mers with γ w > γ w (cid:46)

1. T-centeredmotifs are dominated by coldspots and their mutabili-ties do not align well with their corresponding reversecomplement counterparts. This is in agreement with theknown property of Polymerase η to be prone to errors atA nucleotides on the top strand [27].The C- and G-centered motifs have largely reverse-complement-symmetric rates (see Fig. S2). As previ-ously noted [16], this is in agreement with the strand-symmetric targeting of C/G-centered motifs by the AIDenzyme.The previously reported WRCY/RGYW motif [13,28] predicts high mutability reasonably well, while theSYC/GRW class of motifs [29] explains well a good frac-tion of coldspot motifs. Importantly, a large number ofhigh or low mutability 5-mers do not belong to any of thepreviously reported motifs (see Supplementary Tables 1and 2).The rugged proﬁle of position dependence (Fig. 3E)shows clear enrichment in mutations in the CDR1 and CDR2 regions, reﬂected in the up to 2-fold increase ofthe position-dependent rates. Framework regions are lessmutated and we also observe a slight drop in the muta-bilities of the positions beyond the Cysteine anchor ofthe CDR3 region. We also learned models where the po-sition was deﬁned from the 3’ end of the sequence in theJ segment (Fig. S3), yielding similar results but no clearimprovement over 5’-based position. High mutability ofCDR1 and CDR2 was already noted [30] and justiﬁedas an enrichment in highly mutable motifs (as quantiﬁedwith the S5F model). Our ﬁndings suggest that there isa secondary mechanism of this enrichment, having to doeither with accessibility of mutation-inducing enzymes ora superposition of context-dependent eﬀects that evadethe assumption of independent evolution at diﬀerent sitesand the limitation of 5-mer motifs.Note that introducing the explicit position dependencedoes aﬀect the learning of the context-dependent param-eters: learning γ w with no position dependence (ﬁxing β x = 1) yields similar but markedly diﬀerent parametersthan when learning a free β x ( r = 81%, Fig. S4). Model is consistent across individuals and explainsdata better than previous approaches

To check the model’s generality, we estimated its vari-ability across individuals by computing Pearson’s cor-relation coeﬃcient between the context ( γ w , Fig. 4A)and position ( β x , Fig. 4B) mutability proﬁles of diﬀerentdonors. The precision with which we can estimate modelparameters depends on the number of sequences used forinference, particularly for rare 5-mer contexts. Becausetwo individuals had many more reads than the others 7,we pooled together these seven individuals to make com- M u t a b ili t i e s AA C G T A C G T A C G T A C G T A C G T M u t a b ili t i e s TT G C A T G C A T G C A T G C A T G C A M u t a b ili t i e s CC G T A C G T A C G T A C G T A C G T A M u t a b ili t i e s GG C A T G C A T G C A T G C A T G C A T

100 200 300 400Position x . . . . . . . M u t a b ili t i e s Context dependence: w > w < x ABCDE

CDR1 CDR2 CDR3

FIG. 3: Context and position dependent model parameters. Context-dependent mutabilities γ w for A- (A),T- (B),C- (C) andG-centered (D) 5-mers. The colors indicate known hopspot and coldspot motifs. E. Position-dependent mutabilities β x . Grayshadings show the probability to be in the CDR1, CDR2, or CDR3 regions. Error bars correspond to 95% conﬁdence intervals.See Fig. S5 for an full analysis of parameter uncertainty. parisons with similar dataset sizes (Fig. 4C). We thencompared the 2 individuals and 1 meta-individual witheach other and with a model learned on data from allindividuals. For the 2 individuals with the largest reper-toire datasets, the results are highly reproducible withPearson’s r = 78% for context and r = 70% for po- sition parameters (Fig. 4A), suggesting that the modelcaptures universal biochemical properties of the hyper-mutation process.To further validate the model’s accuracy, we comparedits prediction to data on the V-speciﬁc mutation proﬁles,which consist of the position-dependent mutation rate A ll O t h e r All326651326713Other 7

Context mutabilities A ll O t h e r All326651326713Other 7

Position mutabilities A ll O t h e r N u m b e r o f m u t a t i o n s . . . . . . r

100 150 200 250 300Position in the V gene, x IMGT . . . . . . F r a c t i o n o f m u t a t i o n s IGHV3-23ObservedExpected -

23 3 - - - - - - - -

74 4 - - -

51 1 - - -

13 6 - - - - - IGHV genes (20 most frequent)0 . . . . . P e a r s o n r . . . . . . . . . . F r a c t i o n o f m u t a t i o n s ObservedExpected Cys104 Try118Position in CDR30 . . . . . . F r a c t i o n o f m u t a t i o n s CDR3 length of 50 ntObservedExpected 20 40 60 80 100CDR3 length0 . . . . . . . . P e a r s o n r Mutabilities w S F m u t a b ili t i e s CDR1 FWR2 CDR2 FWR3 CDR3 FWR40 . . . . . . P e a r s o n r Models w x w x S5F

A B CEDF G HJI

CDR1 CDR2

FIG. 4: The model explains the data. Observed proﬁles were measured across the entire dataset used for model inference.See Fig. S6 for an equivalent ﬁgure when data was divided into training and testing sets. Reproducibility of parameters forindividual-speciﬁc models: context (A) and position (B) mutabilities. C. Number of mutations used for inference D. An examplemutation proﬁle in the most common V gene. E. Model performance across V genes. F. Mutation proﬁle in the FWR4 region.G. An example mutation proﬁle in the CDR3 region for CDR3 length of 50 nts. H. Model performance across CDR3 lengths.I. Comparison with the S5F model. J. Summary of models performance across sequence regions. We compare the full context-and position-dependent model ( γ w β x ) with purely context- ( γ w ) and position- ( β x ) dependent models as well as with the S5Fdefault model. for each V segment. These rates result from the com-bined eﬀect of position and context, but they are notﬁtted directly by the model. A typically good exam-ple of such a proﬁle is shown in Fig. 4D. The predictionis generally excellent (Pearson’s r ∼ − γ w with noattempt to disentangle position dependence, so a directcomparison is subject to caution. Besides, S5F mutabil-ities are learned from synonymous mutations of produc-tive sequences, requiring extrapolation methods to coverall 1024 contexts, all of which do not occur with syn-onymous mutations. Yet, the two sets of mutabilities γ w correlate fairly well ( r = 36%, Fig. 4I). Correlationrises to r = 44% for contexts appearing in synonymousmutations, versus r = 18% for the other contexts forwhich S5F recourses to extrapolation, emphasizing thelimitations of that extrapolation.A summary of the performances of the diﬀerent model-ing approaches on the mutabilities in the diﬀerent regionsof the IgH gene is shown in Fig. 4J. We also checkedfor overﬁtting by dividing the dataset into a trainingand a testing, ﬁnding similar results (Fig. S6). The fullposition and context dependent model ( µ s,x = γ w β x )performs better than models with context or positionalone ( µ s,x = γ w and µ s,x = β x ). While the contextexplains the bulk of the mutation proﬁle, adding po-sitional eﬀects substantially boosts performance. Ourmodel clearly outperforms the S5F model, although itshould be reminded that S5F was trained on a distinctdataset. Re-training S5F on the productive sequencesfrom the present datasets using the procedure describedin the original article [16] actually yielded worse perfor-mance (data not shown), for reasons that are unclear tous. Overall, accounting for phylogeny and disentanglingthe combined eﬀects of context and position allows ourmodel to accurately predict mutabilities including in thehyper-variable CDR3 region. Co-localization of mutations cannot be explained bycontext and position bias

It was previously observed that hypermutations tendto cluster along genomic position in nonproductive se-quences [22]. However, the origin of this phenomenonand its dependence on confounding factors such as phy-logeny and heterogeneous hot spot concentration werenot fully characterized.Clustering of mutations can be directly observed by plotting the fraction n ( r ) of pairs of mutations at dis-tance r from each other as a function r (normalized by thetotal number of pairs at that distance, see Fig. S7), whichis also called a spatial correlation function in physics. Fo-cusing on lineages with at least 6 leaves, and iteratingthrough all branches with fewer than 10 mutations, weevaluated this correlation function for pairs of mutationsoccurring in the same branch of the phylogeny versus dis-tant branches, as schematized in Fig. 5A). We then com-pared this correlation function to our model predictions(Fig. S8). The enrichment of closeby mutations can bequantiﬁed by the correlation function f ( r ) = n ( r ) /n m ( r ),where n ( r ), the fraction of pairs of mutations distant by r in the same tree branch is normalized by the modelprediction n m ( r ) (Fig. 5B).Pairs of mutations in distinct branches are well ex-plained by the model, suggesting that they are indepen-dent of each other, in agreement with the biological pic-ture that they occur at diﬀerent rounds of aﬃnity mat-uration. The enrichment of closeby mutations in distantbranches can be entirely explained by the clustering ofhotspot regions. Interestingly, both context and positiondependencies of the mutability are needed to explain thedata (Fig. S8). In contrast, pairs of mutations insidebranches tend to occur closer to each other than pre-dicted by the model. The enrichment of closeby muta-tions is up to ﬁve-fold, pointing to an additional mecha-nism causing hypermutation clustering. We observe thatthis enrichment persists in the presence of selection, asveriﬁed by computing the correlation function f ( r ) inproductive lineages (Fig. 5C). Minimal model of co-localization

To explain the observed excess of co-localized muta-tions, we propose a simple phenomenological model. Tar-geted mutations, following the context and position de-pendent proﬁles described so far, cause additional nearby‘follow-up’ mutations due to error-prone DNA repair.Given a substitution at x drawn from the same distri-bution as before, each position x (cid:54) = x can subsequentlymutate with probability p ( x | x ) = (cid:15) e −| x − x | /ξ , (3)where ξ is the correlation length and (cid:15) is small. The totalnumber of follow-up mutations is approximately Poissondistributed with mean (cid:80) x p ( x | x ) (cid:39) (cid:15)/ (1 − e − /ξ ). Tosimulate this process, we followed the same procedureas described earlier for synthetic data, but allowing forfollow-up, as well as targeted mutations, while keepingthe total number of mutations in each branch constant.We then computed the correlation function f ( r ), andcompared it to true proﬁles (Fig. 5B). We obtain a goodagreement for ξ = 10 and (cid:15) = 5% corresponding to anaverage of ∼ r C o rr e l a t i o n f un c t i o n f ( r ) Inside branchesBetween branchesModelof co-localization 1 25Distance r C o rr e l a t i o n f un c t i o n f ( r ) Inside branchesOut-of-frameProductive 326651Productive 326713

A B C

ACGTAGTCTTTACTGTGACACGTAGTGTTTCCTGTAAC

Pre-MRCAWithin selected branchMutationsOutsideFirst targetFollow-up x ⇠ µ ( x ) = w ( x ) x AAACG3icbZA9SwMxGMdz9a3Wt6o4ORgsQruUuzooFKHg4ljBvkCvHLk0bUOT3JHk1HL0e7i4+BXcXRwUcRIc/CpOptcO2vqHwC//53lInr8fMqq0bX9ZqYXFpeWV9GpmbX1jcyu7vVNXQSQxqeGABbLpI0UYFaSmqWakGUqCuM9Iwx+cj+uNayIVDcSVHoakzVFP0C7FSBvLy5ZuPRu6ZVdR7pahy6O8MQrwDLo9xDny4pvEGLk+0eZmeORlc3bRTgTnwZlCrrL3UDh4/BZVL/vhdgIccSI0ZkiplmOHuh0jqSlmZJRxI0VChAeoR1oGBeJEteNktxE8Mk4HdgNpjtAwcX9PxIgrNeS+6eRI99VsbWz+V2tFunvajqkII00EnjzUjRjUARwHBTtUEqzZ0ADCkpq/QtxHEmFt4syYEJzZleehXio6x8XSpUmjCCZKg31wCPLAASegAi5AFdQABnfgCbyAV+veerberPdJa8qazuyCP7I+fwCgAqLt { x } ⇠ p ( x | x ) = ✏ e | x x | /⇠ AAACHHicbVBNSwMxEM36WevXqkcvwSIoaN3Vg4IIBS8eK9hWaGrJplMNZrNLkpUt2/6QevCvePGgiBcPgv/GtPWg1gcDj/dmmJkXxIJr43mfzsTk1PTMbG4uP7+wuLTsrqxWdZQoBhUWiUhdBlSD4BIqhhsBl7ECGgYCasHt6cCv3YHSPJIXphNDI6TXkrc5o8ZKTfeAZCnpkWOieUiO4620mza9bXyCCcSai0iSHQxX2W433bVGd4+kvNd0C17RGwKPE/+bFEpL/QHuy033nbQiloQgDRNU67rvxaaRUWU4E9DLk0RDTNktvYa6pZKGoBvZ8Lke3rRKC7cjZUsaPFR/TmQ01LoTBrYzpOZG//UG4n9ePTHto0bGZZwYkGy0qJ0IbCI8SAq3uAJmRMcSyhS3t2J2QxVlxuaZtyH4f18eJ9X9on9Q3D+3aRTRCDm0jjbQFvLRISqhM1RGFcRQHz2iZ/TiPDhPzqvzNmqdcL5n1tAvOB9fcuekpA== FIG. 5: Co-localization of subsequent hypermutations. A. Co-localization model explained on an example phylogeny. Aninitial mutation is drawn from the context- and position-dependent model. Then, follow-up mutations are drawn in its vicinitywithin the same branch. B. Correlation function f ( r ) for pairs of mutations: inside (red) and between (green) branches andfor simulated mutations according to the co-localization model (3) with (cid:15) = 5% and ξ = 10. C. Co-localization of subsequenthypermutations in productive lineages from the 2 largest individuals’ datasets (ID=326651 and ID=326713). Correlationfunction f ( r ) for pairs of mutations inside branches compared to the out-of-frame result. The ﬁrst multiplicities of the codonframe length, r = 3 , , ,

12 are marked with dotted lines to guide the eye. Shaded areas represent 95% conﬁdence intervals.

We asked whether this large number of non-targettedmutations may bias the inference of the targeting model,which assumes no follow-up mutations. To assess thiseﬀect, we re-inferred the rates { β x } and { γ w } from syn-thetic datasets simulated with ξ = 10, (cid:15) = 5%, withdata-inferred context proﬁle γ w , and with data-inferredor ﬂat position proﬁles β x (Fig. S9). We ﬁnd that there-inferred mutabilities mostly agree with the true ones,with a slight shrinkage of values and enhanced muta-bilities of cold spots, owing to the equalizing eﬀect co-localization. Importantly, co-localization does not in-troduce additional features in the re-inferred position-dependent proﬁle, indicating that our inference proce-dure is robust to co-localization eﬀects. III. DISCUSSION

The mutational landscape of antibody repertoires re-sults from many entangled eﬀects, which are oftenlumped together into eﬀective models of hypermutations[12, 16, 31]. First, hypermutations have intrinsic prefer-ences for certain positions along the IgH gene, regardlessof their impact on protein function. In addition, selectionfor antibody function, which includes protein stabilityand antigen aﬃnity, favors beneﬁcial mutations and sup-presses deleterious ones [13]. While intrinsic SHM pref-erences are believed to be universal, selective forces varyacross lineages which are involved in distinct immune re- sponses [21], and may also depend on the individual’simmune status [32]. Repertoire sequencing gives a snap-shot of a rapidly adapting population subjected to theseforces, making it hard to disentangle intrinsic SHM pref-erences from the combined eﬀects of selection and geneticdrift. By focusing on non-productive lineages and using aphylogeny-based approach, we overcome the biases aris-ing from the dynamics of aﬃnity maturation to obtain acomprehensive picture of SHM intrinsic preferences.Each hypermutation occurs through a series of eventsof DNA damage and repair. The action of each enzyme,including AID to error-prone DNA repair enzymes, mayeach have their own sequence preferences, and the in-terplay of these diﬀerent biases results in the observedproﬁle. In our approach, these complex mutational path-ways are subsumed into an eﬀective model with a limitednumber of interpretable parameters in terms of eﬀectivecontext and position dependence. As a result, the contextdependent weights γ w do not simply reﬂect the bindingpreference of AID, but also account for the biases of otherbiochemical steps. Our framework enables direct mea-surement of the mutability γ w of a wide range of 5-mercontexts, recovering the known classiﬁcations of hot andcold spots [16, 28]. We show that our model outperformsexisting methods as well as purely context or positiondependent models in terms of explaining the data.The introduction of an explicit and universal positiondependence, β x , allows us to unveil an excess of muta-tions in the CDR1 and CDR2 regions. This enrichmentof mutations cannot be simply explained by their harbor-ing more hotspot contexts. We cannot exclude that thisresidual position dependence is due to more complex con-text eﬀects missed by our model (based e.g. on 7-mers,which would be impractical to infer from the presentdataset). Alternatively, SHM may preferentially targetthese regions independently of their sequence context,possibly through epigenetic mechanisms. Such prefer-ence is known to exist at the genome-wide level to mu-tate the Ig loci without aﬀecting other genes [13], so itis plausible that the same mechanism targets some spe-ciﬁc positions within Ig. The enrichment of mutationsin the CDR1 and CDR2 regions is even more markedin productive sequences, meaning that these mutationsare more likely to be selected during aﬃnity maturation.This suggests that the intrinsically enhanced mutabilityof these regions may carry an evolutionary advantage, byfocusing hypermutations on regions where they are themost beneﬁcial [30]. The stability of the immunoglobulinrelies on the FWR regions and most of the substitutionsare expected to be deleterious. The purifying nature ofselection in FWR regions has been quantiﬁed in Ref. [33]and contrasted with positive selection in CDR regions.By studying mutations along lineages, we were able tostudy mutations in the probable context in which theyoccurred, rather than relative to the germline sequence,allowing us to take into account the order of mutationsand to sample a broader diversity of 5-mer contexts. Thisapproach also allowed us to study and characterize hy-permutations in the CDR3, which has been neglectedin previous work [11] owing to the diﬃculty to separatethese mutations from junctional diversity.The phylogenetic methods employed in this study werenot speciﬁcally designed to study B cell repertoires. Inparticular the assumptions allowing for fast likelihoodcomputations do not account for the context dependenceof the mutation rate beyond the codon frame [23]. Theposition-dependent model introduced here could oﬀer acompromise. While it does not account for the the fullcomplexity of SHM biases, it captures the variation of themutation rate observed in out-of-frame data well (Fig. 4),and can operate under the assumption of independentsite evolution. Our framework could also be easily ex-tended to include position-dependent selection in the nu-cleotide or amino acid representation.Our analysis conﬁrmed a phenomenon of co-localization of mutations along the sequence. While thiseﬀect had been previously reported [22], here we showedthat it could not be explained by phylogenetic bias or theexistence of regions of higher and lower mutabilities. Weproposed a minimal quantitative model of hypermuta-tion targeting, followed by error-prone DNA repair thatcauses follow-up mutations, which explains the data well.While ideally we would like to infer the position and con-text mutability proﬁles taking these follow-up mutationsinto account, the task is impractical because it wouldrequire to identify the origin of each mutation. We ex-pect that doing so would only renormalize the values of the context preferences. While the adaptive advantageof co-localized mutations is unclear, we ﬁnd the corre-lation function in productive lineages follows the unpro-ductive baseline with additional enrichment enhanced atmultiples of the codon length, 3, suggesting signaturesof selection (Fig. 5C). We speculate that nearby muta-tions occurring simultaneously could help cross barriersof positive sign epistasis, whereby two or more mutationsare deleterious by themselves, but beneﬁcial together.This phenomenon could accelerate aﬃnity maturation byfavoring compensatory or epistatic mutations at aminoacids that interact strongly within the antibody protein[34, 35].The obtained mutability models make predictionsabout the likelihood and plausibility of particular tra-jectories of aﬃnity maturation. They could be useful indesigning vaccination strategy, by helping choose targetswith a greater potential for accumulating beneﬁcial mu-tations towards antibodies with desired properties suchas neutralization power, or broadness in the case of fastevolving pathogens such as inﬂuenza or HIV [11, 36]. IV. METHODSData and preprocessing

We perform the analysis on recently published high-throughput RNA sequencing of Ig heavy genes at greatdepth [7].The sequences were barcoded with unique molecularidentiﬁers (UMI) to correct for the PCR ampliﬁcationbias. However, UMI cannot be used to correct sequenc-ing errors, as most UMI were represented by a singlesequence: the number of UMIs used is of the same or-der as the total number of cells in use. We aligned rawsequences using presto of the Immcantation pipeline [37]with setup allowing to correct for errors in UMIs and dealwith insuﬃcient UMI diversity. The V region primerswere masked and the C region primers were used to dis-tinguish the two isotypes of sampled B cells: the IgMand IgG classes. The study of mutation proﬁles in thetwo groups revealed a much lower mutational load in theIgM cohort and hence a higher relative level of sequencingerrors, as well as shallower tree topologies. For furtheranalysis we chose to focus exclusively on the IgG class.Reads were ﬁltered for quality and paired using defaultpresto parameters. Pre-processed data was then alignedto V, D and J templates from IMGT [38] database usingIgBlast [39]. In total there were 3 . × IgG sequencesper person (average 3 . × , median 1 . × ), of whichup to 2% were unproductive (average 5 . × , median2 . × ).0 Inference of evolutionary trajectories

Sequences with a frameshift in the CDR3 region werethen selected and used to reconstruct clonal families asfollows. In the ﬁrst step, reads were aligned to the Vand J templates and grouped into classes of sequenceswith the same V and J gene assigned, as well as equalCDR3 length. In the out-of-frame classes we inferredclonal lineages by single linkage clustering with a thresh-old of 90% on CDR3 region identity [40]. We reconstructmaximum likelihood topologies, as well as the identity ofancestral states, under a simple K80 model of characterevolution [41] for all lineages comprising at least 6 uniquesequences. The model does not capture the complexity ofthe observed mutation proﬁle, but avoids ﬁtting multipleparameters independently in small lineages of relativelyshort alignment. The existing repertoire-wide method[23] is incompatible with out-of-frame lineages since itoperates on 61 productive codons. Ancestral states arefound through marginal reconstruction. Germline V andJ sequences were used as an outgroup to inform the phy-logenetic inference and root the lineage.

Model inference

With the exception of the initial branch, which joinsthe germline sequence and the most recent common an-cestor of the lineage, all branches shorter than 10 substi-tutions were used for model inference.Our task is to ﬁnd a set of parameters { γ w } , { β x } thatmaximise the log-likelihood L = (cid:88) S,T ;( i,j ) ∈ T  (cid:88) x (cid:48) | s i ( x (cid:48) ) (cid:54) = s j ( x (cid:48) ) ln (cid:16) e γ w ( si,x (cid:48) ) β x (cid:48) t − (cid:17) − (cid:88) x γ w ( s i ,x ) β x t (cid:35) , (4)where S is the set of sequences (observed and recon-structed) at each node of the tree, and T encodes the re-constructed phylogenetic tree through its branches ( i, j ),with reconstructed ancestral states s i and s j . The rates µ s,x = β x γ w ( s,x ) are deﬁned so that the length of eachbranch t is expressed in terms of the expected numberof substitutions per basepair (total number of substitu-tions divided by the total alignment length). Imposing ∂ L /∂γ w = 0 yields an implicit expression for γ w as afunction of { β x } , but independent of { γ w (cid:48) } w (cid:48) (cid:54) = w , whichcan be solved by one-dimensional root ﬁnding. Likewise,setting ∂ L /∂β x = 0 gives an implicit expression for β x as a function of { γ w } . We can perform the followingiteration: γ n = arg max γ L ( γ, β n ) (5) β n +1 = arg max β L ( γ n , β ) , (6) which converges to the maximum of L with respect tothe joint { γ w , β x } .To estimate the uncertainty of inferred parameters wesample with replacement from the set of all branches tocreate 400 bootstrap copies. We report 95% conﬁdenceintervals. Substitution models

Not only the targeting rate, but also the identity ofthe substitution is known to depend on the identity ofneighboring bases [16]. In our formulation of the model,inference of the targeting rates does not require knowingthe substitution type, however we can easily extend theframework to include this dependence. The probabilityof mutating from w to w (cid:48) over a period t can be expressedas P ( w → w (cid:48) | t ) = 1 − e − γ w ω w → w (cid:48) β x t , (7)where (cid:80) w (cid:48) ω w → w (cid:48) = 1 and ω w → w (cid:48) (cid:54) = 0 if w (cid:48) is a result ofa substitution at the central position of w . This way weadd 2 × = 2048 parameters to the model. We can inferthe maximum likelihood estimates of ω w → w (cid:48) using thesame iterative scheme introduced in the previous section. Synthetic datasets

We created synthetic datasets using the S5F model ofmutability (downloaded from clip.med.yale.edu/shm )for γ w . We used a ﬂat proﬁle, β x = 1 as well as sinusoidalproﬁles ln β x = 2 sin( x/δ ) − β x = 2 cos( x/δ ) − δ = 50. For each branch ( i, j ), we compute themutability µ s i ,x as a function of x , and then introducemutations at n random positions picked without replace-ment according to µ x / (cid:80) x (cid:48) µ x (cid:48) , where n is the number ofmutations on the branch (ﬁxed by the lineage structuretaken from the real data). Data availibility

All the data analyzed in this paper has been previouslypublished and can be accessed from original publications.Code for producing the ﬁgures of this paper, as well as theinferred model parameters, are freely available at https://github.com/statbiophys/shmoof . Acknowledgements

The study was supported by the European ResearchCouncil COG 724208. The authors are grateful for thediscussions and suggestions from Thomas Dupic, QuentinMarcou and Victor Chard`es.1 [1] Hozumi N, Tonegawa S (1976) Evidence for somatic re-arrangement of immunoglobulin genes coding for vari-able and constant regions.

Proceedings of the NationalAcademy of Sciences

Sci Transl Med

Proceedingsof the National Academy of Sciences { IgH } RepertoiresRevealed by Deep Sequencing.

The Journal of Immunol-ogy

Philos Trans R Soc Lond, B,Biol Sci

Plos One

Nature

Philos Trans R Soc Lond, B, BiolSci

Immunity

Jour-nal of immunology (Baltimore, Md. : 1950)

Cell

Frontiers in Immunology

Trendsin Immunology pp 1–15.[14] Weinstein JA, Jiang N, White RA, Fisher DS, QuakeSR (2009) High-throughput sequencing of the zebraﬁshantibody repertoire.

Science

Nat Biotechnol

Frontiers in Immunology

The Journal of Immunology

Philos Trans R SocLond, B, Biol Sci

Frontiers in Im-munology

Ge-netics

PLOS Computational Biology

Na-ture Communications

Proceedings ofthe National Academy of Sciences of the United States ofAmerica

Journal ofMolecular Evolution

Bioinformatics

Pro-ceedings of the National Academy of Sciences

Fron-tiers in Immunology

Sci-ence

Nature

Molecular Immunology

Journal of Autoimmunity

MolecularBiology and Evolution

Journal of Biological Chemistry

Cell Systems pp 1–8. [36] Liao HX, et al. (2013) Co-evolution of a broadly neutral-izing HIV-1 antibody and founder virus. Nature

Bioinformatics

Nucleic Acids Research

Bioinformatics

Journal of MolecularEvolution Supplementary information True context mutabilities10 I n f e rr e d m u t a b ili t i e s S5F with = 1S5F with = x

100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = 1True proﬁle 100 200 300 400Position x Recovered = x Recovered = x True proﬁles

A B C

CDR3CDR3

FIG. S1: Model inference on synthetic data using true phylogenies. A. Inference of context mutabilities γ . B,C. Inference ofposition mutabilities β for ﬂat and sinusoidal proﬁles. Error bars correspond to 95% conﬁdence intervals. Frequency at whicha given position belongs to the CDR3 region is indicated with the grey shaded areas. Mutabilities w M u t a b ili t i e s ¯ w A/T-centered 5-mers10 Mutabilities w M u t a b ili t i e s ¯ w C/G-centered 5-mers

A B

FIG. S2: Strand asymmetry of the context-dependent rates. For each motif w we juxtapose its mutability γ w with the mutabilityof its reverse complement γ ¯ w . A. 5-mer motifs with strong central nucleotide (C/G), r = 54%. B. 5-mer motifs with weakcentral nucleotide (A/T), r = 7%

100 200 300 400Distance x from V5’ end0 . . . . . . . M u t a b ili t i e s

300 200 100 0Distance x r from J3’ end0 . . . . . . . . A B

CDR1 CDR2 CDR3 CDR1 CDR2 CDR3

FIG. S3: Alternative position deﬁnition. We compare the models based on diﬀerent position deﬁnitions: x , the distance fromthe 5 (cid:48) end of the V gene (A, red) and x r , distance from the 3 (cid:48) end of the J gene (B, green). Error bars correspond to 95%conﬁdence intervals. Frequency at which a given position belongs to a CDR region is indicated with the grey shaded areas. Mutabilities ( = 1)10 M u t a b ili t i e s ( = x ) = 1 versus = x FIG. S4: Introducing explicit position dependence β (cid:54) = 1 inﬂuences the context-dependent rates. We compare the γ mutabilitiesfrom the full model with the parameters of the purely context-dependent model. w sorted by M u t a b ili t i e s x sorted by M u t a b ili t i e s A B

FIG. S5: Analysis of the uncertainty of estimate parameters γ (A) and β (B). The shaded area indicates the 95% conﬁdenceinterval envelope. Motifs and positions are sorted by their respective mutabilities

100 150 200 250 300Position in the V gene, x IMGT . . . . . . F r a c t i o n o f m u t a t i o n s IGHV3-23ObservedExpected -

23 3 - - - - - - - -

74 1 - - - - - - -

69 6 - - -

53 3 - IGHV genes (20 most frequent)0 . . . . . . . . . P e a r s o n r . . . . . . . . . . F r a c t i o n o f m u t a t i o n s ObservedExpected Cys104 Try118Position in CDR30 . . . . . . F r a c t i o n o f m u t a t i o n s CDR3 length of 50 ntObservedExpected 20 40 60 80 100CDR3 length0 . . . . . . . P e a r s o n r Mutabilities w S F m u t a b ili t i e s CDR1 FWR2 CDR2 FWR3 CDR3 FWR40 . . . . . . P e a r s o n r Models w x w x S5F

A BC D EGF

CDR1 CDR2

FIG. S6: Model performance when model was trained on 2/3 of data, and tested on the remaining 1/3. A. Example mutationproﬁle in the most common V gene. B. Model performance across V genes. C. Mutation proﬁle in the FWR4 region. D.Example mutation proﬁle in the CDR3 region for CDR3 length of 50 nts. E. Model performance across CDR3 lengths. F.Comparison with the S5F model. G. Summary of models performance across sequence regions. r F r a c t i o n o f p a i r s Inside branchesBetween branchesModel w x FIG. S7: Fraction of pairs n ( r ) of mutations encoded in the same tree branches (red) and pairs of mutations between diﬀerentbranches (green) normalized by the total number of pairs at that distance, ( l − r ) / (cid:0) l (cid:1) , where l stands for the alignment length.Shaded area corresponds to 95% conﬁdence interval. r F r a c t i o n o f p a i r s Between branchesModel w x Model x Model w FIG. S8: Normalized fraction n ( r ) of pairs of mutations between diﬀerent branches (green). Context- and position-dependentmodels prediction n m ( r ) for this quantity are presented for the full ( µ s,x = γ w β x ), purely context-dependent ( µ s,x = γ w ) andpurely position-dependent ( µ s,x = β x ) models. Shaded area corresponds to 95% conﬁdence interval. True context mutabilities10 I n f e rr e d m u t a b ili t i e s with = 1 0 100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = 1True proﬁle10 True context mutabilities10 I n f e rr e d m u t a b ili t i e s with = ( x ) 100 200 300 400Position x I n f e rr e d m u t a b ili t i e s Recovered = ( x )True proﬁle A BC D

FIG. S9: Model inference on synthetic data with correlated mutations using true phylogenies. Mutations were drawn accordingto the co-localization model (3) with (cid:15) = 5% and ξ = 10 with data-derived context and position dependent mutabilities. A,C.Inference of context mutabilities γ . B,D. Inference of position mutabilities ββ