Heterologous autoimmunity and prokaryotic immune defense
AA scaling law in CRISPR repertoire sizes arises from avoidance of autoimmunity
Hanrong Chen, ∗ Andreas Mayer, † and Vijay Balasubramanian
1, 3 David Rittenhouse Laboratory, Department of Physics and Astronomy, University of Pennsylvania, USA Lewis-Sigler Institute for Integrative Genomics, Princeton University, USA Theoretische Natuurkunde, Vrije Universiteit Brussel, Belgium (Dated: January 6, 2021)Some bacteria and archaea possess an adaptive immune system that maintains a memory of pastviral infections as DNA elements called spacers, stored in the CRISPR loci of their genomes. Thismemory is used to mount targeted responses against threats. However, cross-reactivity of CRISPRtargeting mechanisms suggests that incorporation of foreign spacers can also lead to autoimmunity.We show that balancing antiviral defense against autoimmunity predicts a scaling law relating spacerlength and CRISPR repertoire size. By analyzing a database of microbial CRISPR-Cas systems,we find that the predicted scaling law is realized empirically across prokaryotes, and arises throughthe proportionate use of different CRISPR types by species differing in the size of immune memory.In contrast, strains with nonfunctional CRISPR loci do not show this scaling. We also demonstratethat simple population-level selection mechanisms can generate the scaling, along with observedvariations between strains of a given species.
Clustered regularly interspaced short palindromic re-peats (CRISPR) and CRISPR-associated (Cas) proteinsform a prokaryotic defense against phage [1]. CRISPRloci are composed of DNA repeats alternating with vari-able DNA segments called spacers, acquired from phageand other foreign genetic material. In a process called in-terference, spacer RNA guides sequence-specific bindingand cleavage of target DNA by Cas proteins. In this way,spacers acquired during phage attack confer acquired,heritable resistance against subsequent invasions.CRISPR-Cas systems are remarkably diverse, char-acterized by functionally divergent Cas proteins anddistinct mechanisms for each stage of immune defense[2]. Spacer acquisition is mediated by the conservedCas1–Cas2 adaptation module, which sets spacer lengthswithin a narrow range varying by system [3, 4]. CRISPRarrays are also broadly distributed in size, ranging fromless than 10 to hundreds of spacers, and the full reper-toire of a host may comprise several CRISPR arrays [5].Maintaining a broad spacer repertoire confers resistanceagainst many phages and possible escape mutants [6].However, there are constitutive costs associated with Casprotein expression [7], and diminishing returns of broaddefense due to finite Cas protein copy numbers [8, 9]. Inaddition, CRISPR-Cas systems can prevent horizontaltransfer of beneficial mobile genetic elements [10, 11].CRISPR-Cas systems also cause autoimmunity, occur-ring when a spacer guides interference somewhere on thehost genome, leading to cell death and strong mutationalpressure in the CRISPR-cas locus and target region [12–15]. The patchy incidence of CRISPR-Cas systems inprokaryotes (roughly 40% of bacteria and 85% of archaea[2]), and the presence of diverse mechanisms for self– ∗ These two authors contributed equally to this work.; Presentaddress: Computational & Systems Biology, Genome Instituteof Singapore, Singapore † These two authors contributed equally to this work. nonself discrimination [16], suggest that avoiding autoim-munity is a constraint in the evolution of CRISPR-Cassystems [2, 13–19].Several mechanisms exist in divergent CRISPR-Castypes for suppressing autoimmunity arising from differ-ent forms of potential self-targeting [16]. In type I and IIsystems, interference requires presence of a protospacer-adjacent motif (PAM), a 2–5-nt-long sequence adjacentto target DNA but absent in CRISPR repeats, preventinginterference within the CRISPR array [20, 21]. In typeIII systems, interference requires transcription of targetDNA, which avoids targeting phages integrated into thehost chromosome (prophage) [22]. Spacers acquired fromthe host genome are naturally self-targeting, but thereare mechanisms to suppress such acquisition [23, 24]. Forexample, type I-E systems acquire spacers preferentiallyat double-stranded DNA breaks, which occur primarilyat stalled replication forks of replicating phage DNA, andacquisition is confined by Chi sites which are enriched inbacterial genomes [23].Here we propose that CRISPR evolution is also shapedby heterologous autoimmunity , which occurs if an ac-quired foreign spacer and a segment of the host genomeare sufficiently similar. The likelihood of this effectdepends on sequence statistics and the specificity ofCRISPR targeting mechanisms. Heterologous autoim-munity is analogous to off-target effects that are an im-portant concern in CRISPR-Cas genome editing [25, 26],but the possible effects on prokaryotic adaptive immunityhave not been explored. We combine a probabilistic mod-eling approach with comparative analyses of CRISPRrepertoires across prokaryotes to show that: (a) het-erologous autoimmunity is a significant threat caused byCRISPR-Cas immune defense, (b) avoidance of autoim-munity leads to a scaling law in CRISPR repertoires,and (c) the scaling law can be achieved by population-level selection. Our work suggests that avoidance of het-erologous autoimmunity is a key factor shaping CRISPRrepertoires and the evolution of CRISPR-Cas systems. a r X i v : . [ q - b i o . P E ] J a n I. RESULTSA. Cross-reactivity leads to autoimmunity
We approach heterologous self-targeting as a sequence-matching problem [27–29], and derive estimates for theprobability of a spacer being sufficiently similar to at leastone site in the host genome. For a spacer of length l s and PAM of length l p (where it exists), an exact matchat a given position requires l ≡ l s + l p complementarynucleotides. In a host genome of length L , where L (cid:29) l ,there are L − l + 1 ≈ L starting positions for a match.At leading order, and ignoring nucleotide usage biases,we may treat matches as occurring independently withprobability 4 − l . Thus, the probability of an exact matchanywhere on the genome is (see Methods) p ≡ L − l , where l = l s + l p . (1)Considering order-of-magnitude parameter estimates forthe E. coli type I-E system of L = 5 × nt, l s = 32 nt,and l p = 3 nt gives a negligible probability p ∼ − .However, CRISPR interference tolerates several mis-matches between spacer RNA and target DNA depend-ing on position and identity [25, 26, 30, 31]. In general,mismatches in the PAM are not allowed, and mismatchesin the PAM-distal region are tolerated to a greater extentthan mismatches in the PAM-proximal region [21]. Up to ∼ k fix fixed positions, and (b) mismatches at k var variablepositions anywhere else in the target region. These in-crease the per-spacer self-targeting probability by a com-binatorial factor α ( k fix , k var , l ) (see Methods), so that p self = α ( k fix , k var , l ) p = α ( k fix , k var , l ) L − l . (2)A greater number of allowed mismatches greatly in-creases the likelihood of heterologous self-targeting (Fig.1b). To gain intuition we can rewrite Eq. 2 as p self ≡ L − l eff , where (3) l eff ( l, k fix , k var ) ≡ l − log α ≈ l − k fix − k var log l − k fix ) , where l eff is the effective spacer length after discount-ing for allowed mismatches (see Methods). This shows that mismatches exponentially increase the probabilityof self-targeting, and variable-position mismatches par-ticularly so. Considering the E. coli system as before,the matching probability increases to p self ∼ − with k fix = 5 nt and k var = 5 nt (see Fig. 1b). Other CRISPR-Cas systems may similarly lie in parameter regimes withappreciable p self , especially when including indirect self-targeting through primed acquisition [34]. Furthermore,the probability of self-targeting is likely higher than im-plied by our calculations as it can be increased by cor-relations in sequence statistics between host and phagegenomes [28, 29]. Given our estimates, we thus hypothe-size that heterologous autoimmunity may occur generallyand be a significant cost of CRISPR-Cas immunity. B. Spacer length scales with repertoire size
To test this hypothesis, we exploited the large naturalvariability in CRISPR systems across different microbialspecies. As the self-targeting probability depends expo-nentially on spacer length (Eqs. 2–3), we expect smalldifferences in length to lead to large variations in the riskof autoimmunity. If CRISPR repertoire sizes are selectedto balance broader immunity against the risk of autoim-munity, then qualitatively we expect that species withshorter spacers should have smaller repertoires, whilespecies with longer spacers should have larger ones (Fig.1c).To make this prediction more quantitative, supposeprokaryotes tolerate a maximum probability P of self-targeting, and that CRISPR-Cas systems are selected tomaximize protection against pathogens subject to thisconstraint. Repertoires with N spacers incur a self-targeting probability of ∼ N p self , and thus Eq. 2 impliesln N = l ln 4 − ln α ( k fix , k var , l ) − ln( L/P ) . (4)Linearizing the dependence of the combinatorial factor α around typical spacer lengths l (see Methods) predictsa scaling relationship between spacer length and the log-arithm of repertoire sizeln N ∼ l (cid:18) ln 4 − k var l − k fix (cid:19) ∼ . l, (5)where we arrived at the latter estimate by taking k var ∼ k fix ∼ l ∼ Figure 1.
CRISPR-Cas immune defense incurs a risk of heterologous autoimmunity. a , Sketch of the maincomponents of CRISPR-Cas immune defense. b , Per-spacer probability of heterologous self-targeting, p self , as a function of thenumber of tolerated mismatches at fixed and variable positions along the spacer, k fix and k var , respectively. c , We hypothesizethat the evolution of CRISPR-Cas systems is constrained by the risk of heterologous autoimmunity. As the self-targetingprobability depends strongly on spacer length, this predicts a scaling of repertoire size with spacer length. spacer repertoire sizes, defined as the sum of CRISPRarray sizes in each genome, was broad, ranging from 1 to812 spacers (Fig. 2b).A linear regression between spacer length and log-repertoire size gave a slope of 1 . ± . k var = 3 . ± . k fix = l s / Figure 2.
A scaling law relates CRISPR repertoire size and spacer length. a, b
Distributions of spacer lengths (a)and repertoire sizes (b) across prokaryotes. For each of 2,449 species with CRISPR and cas loci we randomly picked a singlestrain (see Methods), and calculated its repertoire size as the sum of all CRISPR array sizes present in the genome. The lengthdistribution of all spacers found in these filtered strains are shown in (a). Bins in (b) were formed by dividing each decade into10 equal bins on a log scale. c , Scaling of repertoire size with spacer length. A linear fit of the mean spacer length againstlog repertoire size was performed on all 2,449 species, and is shown alongside the data, which is binned by repertoire size(50 strains/bin). The fitted slope is consistent with theory predictions (Eq. 5). d , Fraction of species with missing cas genesdecreases with spacer length. e , Spacer length and repertoire size do not show a clear relation in strains with nonfunctionalCRISPR loci. A linear fit was performed on 340 species with CRISPR arrays but no cas loci, and is shown alongside the data,which is binned by repertoire size (50 strains/bin). Error bars in panels c–e denote the standard error of the mean in each bin,which in (d) are calculated assuming a binomial probability distribution for the absence of cas at each spacer length. C. Variable CRISPR-Cas type use underlies scaling
CRISPR-Cas systems are classified into different typesand subtypes based on their evolutionary relationshipsand the use of different cas genes [2]. We won-dered whether the aggregate scaling relationship betweenspacer length and repertoire size (Fig. 2) reflected dif-ferences at the level of CRISPR-Cas type usage. Wethus grouped the species by subtype, when there is a sin-gle CRISPR-Cas system in the genome, or in a separategroup when multiple subtypes are present.For species carrying a single cas type, we aggregatedall spacers found across species of each type to quantifythe statistics of acquired spacer lengths. We observed dif-ferences in the spacer length distributions between types(Fig. 3a): (a) Type II-A and II-C systems have narrowdistributions tightly clustered around 30 nt; (b) TypeI-E and I-F systems also have narrow distributions, clus-tered around 32 nt, while other type I systems have spac-ers that are longer and more broadly distributed; (c)Type III systems have even longer and more broadly dis- tributed spacers, with median lengths in the 36–39 ntrange.A broader distribution of acquired spacer lengths leadsto a higher risk of autoimmunity than a narrow distribu-tion with the same mean, since the self-targeting prob-ability increases exponentially for shorter spacers. Toaccount for an increase in autoimmune risk for broaderdistributions, we focused on the lower quartile of spacerlengths for each cas type as a proxy for autoimmune risk.Also, to account for the requirement of PAM recognitionin type I and II (but not type III) systems, we added aPAM length of 3 nt to types I and II to obtain the overalllength l . Strikingly, we observed that the predicted rela-tionship between l and repertoire size also broadly holdsbetween CRISPR-Cas types (Fig. 3b): Type II systemshave the shortest spacers and the smallest repertoires,and among type I subtypes those with shorter spacersgenerally have smaller repertoires. Type III systems havesmaller repertoires than type I systems despite somewhatlonger spacers, but this is explained by the absence ofPAMs and the broader spacer length distributions for Figure 3.
Preferential CRISPR-Cas type use underlies the scaling law relating spacer length and repertoire size.a , Length distributions of all spacers found in single-type strains aggregated by CRISPR-Cas type (for types with >
10 speciesin CRISPRCasdb [35]; see Methods). Also indicated are the median (solid vertical) and lower quartile (dotted vertical line) foreach distribution. The subtypes are presented in order of lower quartile. b , A trend is observed between spacer + PAM lengthand repertoire size for different CRISPR-Cas types. For spacer lengths, the central dot is the lower quartile and the whiskerruns between the lowest decile and the median. Repertoire sizes are indicated as the mean ± standard error. To indicate therequirement of PAM recognition, a length of 3 nt was added to all type I and II (but not type III) subtypes. c , Variable usageof cas subtypes among multiple-type strains. A total of 826 strains with multiple CRISPR-Cas systems, randomly picked fromdifferent species, were analyzed. They were divided into 3 groups of 275, 275 and 276 strains having small, medium, and largerepertoire size, respectively. Each subfigure was normalized to 1, so that the bars indicate the relative incidence of a subtypein each repertoire size bin. The order of subtypes is the same as panel a. the type III systems, both of which increase autoimmunerisk.We next tested whether this relation also carries overto species carrying multiple CRISPR-Cas systems, in theform of a differential use of cas types as a function ofrepertoire size. We divided species with multiple castypes into three equally sized groups by repertoire size,and determined the relative incidence of CRISPR sub-types within each group (Fig. 3c). We found that theuse of types II, I-E, and I-F decreases with repertoire sizein line with expectations, and an opposite pattern for twoof the type III systems and the type I systems with thelongest spacers. The relation between total repertoiresize and spacer length in species with multiple cas sub-types was further reinforced by a direct analysis of theincidence of spacers of different lengths as a function ofrepertoire size, with a greater proportion of longer spac-ers present in larger repertoires (Fig. S2).Taken together, we find that species carrying eithersingle or multiple CRISPR-Cas systems differentially useCRISPR-Cas types having different spacer length distri-butions to form repertoires of different sizes. This differ-ential use gives rise to the aggregate scaling observed inFig. 2c, and is consistent with the hypothesis of minimiz-ing the risk of heterologous autoimmunity. D. Dynamical origin of the scaling law
Dynamical mechanisms can give rise to the scaling lawthat our theory predicts, and which is found in the em-pirical data. While spacer dynamics involves complexepidemiological feedbacks [8, 37–44], here we consider asimple effective model in which spacer acquisition andloss are described as a birth-death process, such thatspacers are acquired at a rate b and lost at a per-spacerrate d (Fig. 4a, left panel). This gives rise to a Pois-son distribution of repertoire sizes at steady state, withmean b/d (see SI). Our statistical theory requires that themean of the distribution should shift with spacer length.There are two mechanisms by which selection could leadto such a dependence. First, the negative fitness effect ofacquiring self-targeting spacers [13] purges lineages thatundergo deleterious acquisition events. Indeed, CRISPRarrays are selected for the absence of self-targeting spac-ers [12]. Effectively, this reduces the net acquisition rateamong surviving lineages. Second, over longer evolution-ary timescales, different CRISPR-Cas systems may beselected to acquire spacers at different rates dependingon their respective risks of autoimmunity. These differ-ences in rates could arise from the maintenance of mul-tiple copies of cas genes, or through regulation of casexpression [45]. Indeed, spacer repertoire size increaseswith the number of cas loci (Fig. S3), suggesting thatlarger gene copy numbers of cas1 and cas2, necessary forspacer acquisition, result in greater acquisition rates. In-terestingly, strains having exactly one copy of both cas1 Figure 4.
A spacer acquisition-loss model with population-level selection of acquisition rates produces scalingas well as substantial variability between strains of the same species. a , In our model, strains acquire spacers at arate b and lose them with a per-spacer rate d , giving rise to a Poisson distribution at steady state with mean b/d (left panel). b is selected to minimize the risk of heterologous autoimmunity, such that species differing in p self have different mean repertoiresizes (middle panel). We generate a synthetic dataset of strains by sampling from steady-state distributions with different spacerlengths and hence p self (see Methods). The synthetic data displays scaling of the mean and variability on the single-strain level(right panel). The green points show 100 individually sampled strains, the blue points means after binning by repertoire size(50 species/bin, 2,450 species total), and the orange line is a fit to all 2,450 sampled points. b , Correlated spacer loss broadensthe predicted distribution of repertoire sizes. We consider a model in which all spacers are lost simultaneously during a deletionevent, which leads to a geometric steady-state distribution (see SI). Despite this additional variability, a synthetic samplegenerated as in panel a shows scaling of the means. c , The distributions of repertoire sizes of sequenced strains belonging to thesame species are broad. 4 pairs of species with >
50 sequenced strains and with the indicated CRISPR-Cas type are displayed.Vertical lines denote the mean for each species. and cas2 still obey a scaling relationship (Fig. S4), sug-gesting that regulation of these genes also contributes tominimizing autoimmune risk.Let us suppose that one or both of these selectionmechanisms lead to an effective spacer acquisition rateinversely proportional to the risk of self-targeting, b ∝ /p self . To replicate the empirical analysis, we created asynthetic dataset of the same size by sampling strainsat random from steady-state distributions at differentspacer lengths, which have different p self (Fig. 4a, mid-dle panel) (see Methods). Plotting spacer length againstmean repertoire size in the same way as we did for theempirical data, we recover a scaling law as predicted byour theory (Fig. 4a, right panel).In addition to providing a dynamical explanation forscaling of the means, this birth-death model producessubstantial variability around the mean relationship (Fig.4a, right panel, green dots). In fact, the Poisson vari- ance of Fig. 4a, originating from a constant per-spacerloss rate, is likely an underestimate. Spacer loss occursthrough double-stranded DNA breaks followed by ho-mologous recombination at a different CRISPR repeat,a process which may delete chunks of an array in a singledeletion event (see e.g. [46, 47]). Including such cor-related spacer loss greatly increases the variance. Forexample, in a simple analytically solvable limiting casewhere entire arrays are lost at once, the distribution ofrepertoire sizes becomes geometric (see SI) and thus verybroad (Fig. 4b, left panel). Given this substantial vari-ability, we wondered whether a comparative analysis ofmany species could still recover a scaling in this model.We thus sampled strains from geometric steady-state dis-tributions whose means obey a scaling law as in panel a,observing a much larger variability in individual strains(Fig. 4b, right panel, green dots). However, the scalingof the mean is recovered with a fit to the dataset andwhen strains are binned by repertoire size (Fig. 4b, rightpanel).Prompted by the broad variability predicted by cor-related spacer loss, we analyzed the repertoire size dis-tributions of species with many sequenced strains (seeMethods). We indeed observed a broad distribution evenamong strains of the same species (Fig. 4c). We com-pared the repertoire size distributions of four pairs ofhighly-sampled species with the same CRISPR-Cas type,and found that the within-species variability comprises asubstantial part of the overall variance. Additional vari-ability between species, leading to different mean reper-toire sizes for species with the same CRISPR-Cas type,might originate from different microbes inhabiting envi-ronments that differ in viral diversity and thus pressureto acquire broad immune defense. We tested the robust-ness of the comparative analysis to this additional sourceof variability, by sampling strains from steady-state dis-tributions where we additionally sample the prefactor in b ∝ /p self from a wide distribution (see Methods). Thisfurther increases the variability of individually sampledstrains, but the means still show scaling (Fig. S5). II. DISCUSSION
An adaptive immune system is dangerous equipmentto have in an organism. There is always the risk that theimmune receptors, intended as defenses against foreigninvaders, will instead target the self. In CRISPR-Cassystems, biophysical mechanisms avoiding various formsof autoimmunity such as targeting of the CRISPR locusand self-spacer acquisition are known [16, 20, 21, 23], buthere we propose that heterologous autoimmunity, wherespacers acquired from foreign DNA seed self-targeting, isa significant threat to microbes carrying CRISPR-Cas.This threat is analogous to off-target effects in genome-editing applications [25, 26], and has been observed in anexperimental CRISPR-Cas system [33], but its wider im-plications for the evolution of CRISPR-Cas systems areunexplored. We showed that avoidance of this form of au-toimmunity while maximizing antiviral defense predictsa scaling law relating spacer length and CRISPR reper-toire size. The scaling depends on the number and natureof sequence mismatches permitted during CRISPR inter-ference and primed acquisition.To test our prediction we used a comparative approachanalyzing the natural variation in CRISPR-Cas systemsacross microbial species, and demonstrated that: (a) thepredicted scaling law is realized, (b) the observed scalingconstrains parameters for cross-reactive CRISPR target-ing to lie in a range consistent with experimental stud-ies, (c) the scaling arises in part from differential usageof different CRISPR-cas subtypes having different spacerlength distributions, and (d) the scaling, and hence a bal-anced tradeoff between successful defense and autoimmu-nity, can be achieved by population-level selection mech-anisms. In addition, we demonstrated a negative control: CRISPR arrays in species that no longer have functionalCas proteins, and thus are not at risk of autoimmunity,do not show the predicted scaling relation. We proposetwo further tests of the link between spacer length andautoimmune risk: (1) If cross-reactivity leads to self-targeting, in addition to a depletion of self-targeting spac-ers in CRISPR arrays [12, 36], we predict a depletion ofspacers several mismatches away from self-targets, and(2) Our theory predicts that CRISPR-Cas subtypes withlonger spacers should acquire spacers more readily.A similar tradeoff between sensitivity to pathogens andautoimmune risk shapes the evolution of vertebrate adap-tive immune systems [27, 48]. In the light of our results itwould be interesting to determine whether this tradeoffalso leads to a relation between the size of the immunerepertoire and specificity in vertebrates. Such a relationwill likely be harder to ascertain for vertebrates as pat-terns of cross-reactivity between lymphocyte receptorsand antigens are more complex. Interestingly, however,T cell receptor hypervariable regions in human are sev-eral nucleotides longer on average than those found inmice [49], which accompanies a substantial increase inrepertoire size in human. If longer hypervariable regionstranslate to a greater specificity on average, one mightview the increased human receptor length as an adapta-tion to a larger repertoire. The key to our current workwas the ability to compare microbial immune strategiesacross a large panel of phylogenetically distant species.Further insight into how this tradeoff shapes vertebrateimmune systems might thus be gained by building onrecent efforts to survey adaptive immune diversity in abroader range of vertebrates [50, 51].Many theoretical studies of adaptive immunity in bothprokaryotes [8, 37–44] and vertebrates [52–55] considerdetailed dynamical models of evolving immune reper-toires. For prokaryotes, such dynamical models can beregarded as describing the role of CRISPR-Cas as a short-term memory for defense against a co-evolving phage[56]. Studying adaptive immunity in this way requiresdetailed knowledge of the parameters controlling the dy-namics, many of which are not well-characterized experi-mentally. In this paper, we took an alternative approachof focusing on the statistical logic of adaptive immunity,where we regard the bacterial immune system as a func-tional mechanism for maintaining a long-term memory[56] of a diverse phage landscape [57], via probabilisticmatching of genomic sequences. Previous work takingthis perspective offered an explanation for why prokary-otic spacer repertoires lie in the range of a few dozen toa few hundred spacers [56]. As in our discussion of possi-ble mechanisms for generating the observed scaling law,evolution should select dynamics that achieve the statis-tical organization that we predict, because this is what isuseful for achieving a broad defense against phage whileavoiding autoimmunity. A probability theory perspec-tive of this kind has been applied to the logic of theadaptive immune repertoire of vertebrates [58–60], butto our knowledge we are presenting a novel approach tothe study of CRISPR-based autoimmunity.
III. MATERIALS AND METHODS a. Derivation of self-targeting probability.
We esti-mate the probability of an alignment between a spacer +PAM sequence of length l and a host genome of length L . We assume that both sequences are random and un-correlated, with nucleotide usage frequencies of 1/4. Ina length- L genome, where L (cid:29) l , there are L − l + 1 ≈ L starting positions for an alignment. The matching prob-ability at each position, p m , depends on the number andnature of mismatches tolerated. In regimes where p m issmall, the matching probabilities at the different posi-tions may be treated independently. Thus, the probabil-ity of having at least one alignment within the length- L genome is p self = 1 − (1 − p m ) L − l +1 ≈ Lp m , since p m (cid:28) , l (cid:28) L. (6)If no mismatches are tolerated, p m = 4 − l as in Eq. 1.At each site where a mismatch is allowed, four alterna-tive nucleotide choices are possible. This gives a certainnumber α of unique complementary sequences matchingto a given spacer, which we compute as a function ofthe number and nature of mismatches. If up to k fix mis-matches are tolerated at fixed positions in the alignment, α = 4 k fix . If instead up to k var mismatches are tol-erated anywhere in the complementary region, naively α ∼ (cid:0) lk var (cid:1) k var , where the binomial coefficient is thenumber of combinations of sites where mismatches areallowed. This is however an upper bound as matchingsequences are overcounted, and the precise expression is α = k var (cid:88) i =0 (cid:18) li (cid:19) i , (7)where each term in the sum is the number of unique com-plementary sequences having exactly i mismatches. Thelargest term dominates, giving α ≈ (cid:0) lk var (cid:1) k var . Thus,combining k fix mismatches at fixed positions and up to k var mismatches at any of the remaining l − k fix positionsgives α ( k fix , k var , l ) ≈ k fix (cid:18) l − k fix k var (cid:19) k var . (8)We can introduce an effective spacer length, l eff , by p m ≡ − l eff . To leading order the binomial expression in Eq. 8is approximated by ( l − k fix ) k var . This gives l eff ≈ l − k fix − k var log l − k fix ) as in Eq. 3. The probability that a repertoire of N spacers avoidsself-targeting, 1 − P self , is one minus the probability thatat least one spacer self-targets. This gives P self = 1 − (1 − p self ) N ≈ N p self , since Lp m (cid:28) . (9)If CRISPR repertoires are selected to maximize reper-toire size subject to the constraint P self ≤ P , we obtainEq. 4. Taylor expanding ln N around l = l gives Eq. 5to lowest order in l . b. Comparative analyses. For our comparative anal-yses we use CRISPRCasdb [5], which is a database ofCRISPR and cas loci identified using CRISPRCasFinder[61] in public bacterial and archaeal whole-genome as-semblies [35]. CRISPR arrays are assigned evidence lev-els 1–4, 4 being the highest confidence [61]. We restrictedour analysis to level 4 CRISPR arrays only. Strains con-taining both annotated CRISPR and cas loci were usedfor the analyses in Figs. 2a–c, 3, and 4c. Strains contain-ing annotated CRISPR but no cas loci were used for theanalyses in Figs. 2d–e. In order to eliminate oversam-pling of certain species, we picked one strain at randomfrom each species for further analysis (2,449 species withannotated CRISPR and cas loci, and 340 species withannotated CRISPR but no cas loci). To produce Fig. 3,the randomly chosen strains were grouped by annotatedcas subtype, or into a separate group if they contain mul-tiple subtypes. The 12 subtypes shown in Figs. 3a and chave >
10 species represented in CRISPRCasdb. To pro-duce Fig. 4c, 4 pairs of species, each with >
50 sequencedstrains of the same CRISPR-Cas type, were chosen foranalysis. c. Synthetic data generation and analysis.
A syn-thetic dataset producing a scaling law was generatedin the following way: (1) A spacer of length l s wasdrawn from the length distribution of Fig. 2a, and (2)a repertoire size distribution with mean A/p self was cre-ated, from which one strain was sampled and added tothe dataset. Parameter values of L = 5 · , l p = 3, k fix = l s / k var = 3, and A = 10 − . were used. Thesteady-state distributions are Poisson in Fig. 4a, and ge-ometric with the same mean in Fig. 4b. In Fig. S5, A wassampled from a log-normal distribution with the samemean, and standard deviation chosen such that the coef-ficient of variation is 1.2. Acknowledgements.
We thank Serena Bradde forhelpful comments on this paper. VB and HC were sup-ported in part by a Simons Foundation grant in Mathe-matical Modeling for Living Systems (400425) for Adap-tive Molecular Sensing in the Olfactory and Immune Sys-tems, and by the NSF Center for the Physics of Biolog-ical Function (PHY-1734030). AM was supported by aLewis–Sigler fellowship. VB thanks the Aspen Center forPhysics, which is supported by NSF grant PHY-160761,for hospitality during this work. [1] R. Barrangou, C. Fremaux, H. Deveau, M. Richards,P. Boyaval, S. Moineau, D. A. Romero, and P. Horvath,Science , 1709 (2007).[2] K. S. Makarova, Y. I. Wolf, J. Iranzo, S. A. Shmakov,O. S. Alkhnbashi, S. J. J. Brouns, E. Charpentier,D. Cheng, D. H. Haft, P. Horvath, S. Moineau, F. J. M.Mojica, D. Scott, S. A. Shah, V. Siksnys, M. P. Terns,C. Venclovas, M. F. White, A. F. Yakunin, W. Yan,F. Zhang, R. A. Garrett, R. Backofen, J. van der Oost,R. Barrangou, and E. V. Koonin, Nat. Rev. Microbiol. , 67 (2020).[3] J. Wang, J. Li, H. Zhao, G. Sheng, M. Wang, M. Yin,and Y. Wang, Cell , 840 (2015).[4] J. K. Nunez, L. B. Harrington, P. J. Kranzusch, A. N.Engelman, and J. A. Doudna, Nature , 535 (2015).[5] C. Pourcel, M. Touchon, N. Villeriot, J. P. Vernadet,D. Couvin, C. Toffano-Nioche, and G. Vergnaud, NucleicAcids Res. , D535 (2020).[6] S. van Houte, A. K. E. Ekroth, J. M. Broniewski,H. Chabas, B. Ashby, J. Bondy-Denomy, S. Gandon,M. Boots, S. Paterson, A. Buckling, and E. R. Westra,Nature , 385 (2016).[7] P. F. Vale, G. Lafforgue, F. Gatchitch, R. Gardan,S. Moineau, and S. Gandon, Proceedings of the RoyalSociety B: Biological Sciences , 20151270 (2015).[8] A. Martynov, K. Severinov, and I. Ispolatov, PLoSComp. Bio. , 1 (2017).[9] S. Bradde, A. Nourmohammad, S. Goyal, and V. Bal-asubramanian, Proc. Natl. Acad. Sci. U.S.A. , 5144(2020).[10] L. A. Marraffini and E. J. Sontheimer, Science , 1843(2008).[11] W. Jiang, I. Maniv, F. Arain, Y. Wang, B. R. Levin, andL. A. Marraffini, PLoS Genetics , e1003844 (2013).[12] A. Stern, L. Keren, O. Wurtzel, G. Amitai, and R. Sorek,Trends Genet. , 335 (2010).[13] R. B. Vercoe, J. T. Chang, R. L. Dy, C. Taylor, T. Grist-wood, J. S. Clulow, C. Richter, R. Przybilski, A. R. Pit-man, and P. C. Fineran, PLoS Genet. , e1003454 (2013).[14] D. Paez-Espino, W. Morovic, C. L. Sun, B. C. Thomas,K. Ueda, B. Stahl, R. Barrangou, and J. F. Banfield, Nat.Commun. , 1430 (2013).[15] Y. Wei, R. M. Terns, and M. P. Terns, Genes Dev. ,356 (2015).[16] L. A. Marraffini, Nature , 55 (2015).[17] R. Edgar and U. Qimron, J. Bacteriol. , 6291 (2010).[18] G. W. Goldberg, E. A. McMillan, A. Varble, J. W. Mod-ell, P. Samai, W. Jiang, and L. A. Marraffini, Nat. Com-mun. , 61 (2018).[19] C. Rollie, A. Chevallereau, B. N. J. Watson, T. Y. Chyou,O. Fradet, I. McLeod, P. C. Fineran, C. M. Brown,S. Gandon, and E. R. Westra, Nature , 149 (2020).[20] H. Deveau, R. Barrangou, J. E. Garneau, J. Labont´e,C. Fremaux, P. Boyaval, D. A. Romero, P. Horvath, andS. Moineau, J. Bacteriol. , 1390 (2008).[21] E. Semenova, M. M. Jore, K. A. Datsenko, A. Semenova,E. R. Westra, B. Wanner, J. van der Oost, S. J. Brouns,and K. Severinov, Proc. Natl. Acad. Sci. U.S.A. ,10098 (2011).[22] G. W. Goldberg, W. Jiang, D. Bikard, and L. A. Marraf-fini, Nature , 633 (2014). [23] A. Levy, M. G. Goren, I. Yosef, O. Auster, M. Manor,G. Amitai, R. Edgar, U. Qimron, and R. Sorek, Nature , 505 (2015).[24] J. W. Modell, W. Jiang, and L. A. Marraffini, Nature , 101 (2017).[25] Y. Fu, J. A. Foden, C. Khayter, M. L. Maeder, D. Reyon,J. K. Joung, and J. D. Sander, Nat. Biotechnol. , 822(2013).[26] P. D. Hsu, D. A. Scott, J. A. Weinstein, F. A. Ran,S. Konermann, V. Agarwala, Y. Li, E. J. Fine, X. Wu,O. Shalem, T. J. Cradick, L. A. Marraffini, G. Bao, andF. Zhang, Nat. Biotechnol. , 827 (2013).[27] J. K. Percus, O. E. Percus, and A. S. Perelson, Proc.Natl. Acad. Sci. U.S.A. , 1691 (1993).[28] S. Karlin and S. F. Altschul, Proc. Natl. Acad. Sci. U.S.A. , 2264 (1990).[29] A. Dembo, S. Karlin, and O. Zeitouni, The Annals ofProbability , 2022 (1994).[30] P. C. Fineran, M. J. Gerritzen, M. Su´arez-Diez,T. K¨unne, J. Boekhorst, S. A. van Hijum, R. H. Staals,and S. J. Brouns, Proc. Natl. Acad. Sci. U.S.A. ,E1629 (2014).[31] C. Jung, J. A. Hawkins, S. K. Jones, Y. Xiao, J. R. Ry-barski, K. E. Dillard, J. Hussmann, F. A. Saifuddin, C. A.Savran, A. D. Ellington, A. Ke, W. H. Press, and I. J.Finkelstein, Cell , 35 (2017).[32] K. A. Datsenko, K. Pougach, A. Tikhonov, B. L. Wanner,K. Severinov, and E. Semenova, Nat. Commun. , 945(2012).[33] R. H. Staals, S. A. Jackson, A. Biswas, S. J. Brouns,C. M. Brown, and P. C. Fineran, Nat. Commun. , 12853(2016).[34] T. J. Nicholson, S. A. Jackson, B. I. Croft, R. H. Staals,P. C. Fineran, and C. M. Brown, RNA Biology , 566(2019).[35] Grissa, Ibtissem and Drevet, Christine and Couvin,David, CRISPRCasdb (2020), [Online; accessed 26-July-2020].[36] K. E. Watters, C. Fellmann, H. B. Bai, S. M. Ren, andJ. A. Doudna, Science , 236 (2018).[37] J. He and M. W. Deem, Physical Review Letters ,128102 (2010).[38] B. R. Levin, PLoS Genet. , e1001171 (2010).[39] L. M. Childs, N. L. Held, M. J. Young, R. J. Whitaker,and J. S. Weitz, Evolution , 2015 (2012).[40] B. R. Levin, S. Moineau, M. Bushman, and R. Barran-gou, PLoS Genet. , e1003312 (2013).[41] J. Iranzo, A. E. Lobkovsky, Y. I. Wolf, and E. V. Koonin,J. Bacteriol. , 3834 (2013).[42] A. D. Weinberger, C. L. Sun, M. M. Pluci´nski, V. J.Denef, B. C. Thomas, P. Horvath, R. Barrangou, M. S.Gilmore, W. M. Getz, and J. F. Banfield, PLoS Comp.Biol. , e1002475 (2012).[43] S. Bradde, M. Vucelja, T. Te¸sileanu, and V. Bal-asubramanian, PLoS Comp. Bio. , 1 (2017),arXiv:1510.06082.[44] P. Han and M. W. Deem, Journal of the Royal SocietyInterface (2017).[45] A. G. Patterson, M. S. Yevstigneyeva, and P. C. Fineran,Current Opinion in Microbiology , 1 (2017). [46] G. W. Tyson and J. F. Banfield, Environ. Microbiol. ,200 (2008).[47] P. Horvath, D. A. Romero, A. C. Coˆut´e-Monvoisin,M. Richards, H. Deveau, S. Moineau, P. Boyaval, C. Fre-maux, and R. Barrangou, J. Bacteriol. , 1401 (2008).[48] C. J. E. Metcalf, A. T. Tate, and A. L. Graham, NatureEcology & Evolution , 1 (2017).[49] Z. Sethna, Y. Elhanati, C. S. Dudgeon, C. G. Callan,A. J. Levine, T. Mora, and A. M. Walczak, Proc. Natl.Acad. Sci. U.S.A. , 201700241 (2017).[50] R. Castro, S. Navelsaker, A. Krasnov, L. Du Pasquier,and P. Boudinot, Developmental and comparative im-munology , 28 (2017).[51] R. Covacu, H. Philip, M. Jaronen, D. C. Douek, S. Efroni,F. J. Quintana, R. Covacu, H. Philip, M. Jaronen,J. Almeida, J. E. Kenison, and S. Darko, Cell Reports , 2733 (2016).[52] R. Antia, V. V. Ganusov, and R. Ahmed, Nature ReviewsImmunology , 101 (2005).[53] G. Lythe, R. E. Callard, R. L. Hoare, and C. Molina-Par´ıs, Journal of Theoretical Biology , 214 (2016).[54] J. Desponds, A. Mayer, T. Mora, and A. M. Walczak,arXiv preprint arXiv:1703.00226 (2017).[55] M. Gaimann, M. Nguyen, J. Desponds, and A. Mayer,eLife , e61639 (2020).[56] S. Bradde, A. Nourmohammad, S. Goyal, and V. Bal-asubramanian, Proc. Natl. Acad. Sci. U.S.A. , 5144(2020).[57] R. A. Edwards and F. Rohwer, Nature Reviews Microbi-ology , 504 (2005).[58] A. Mayer, V. Balasubramanian, T. Mora, and A. M. Wal-czak, Proc. Natl. Acad. Sci. U.S.A. , 5950 (2015).[59] A. Mayer, T. Mora, O. Rivoire, and A. M. Walczak, Proc.Natl. Acad. Sci. U.S.A. , 8630 (2016).[60] A. Mayer, V. Balasubramanian, A. M. Walczak, andT. Mora, Proc. Natl. Acad. Sci. U.S.A. , 8815 (2019).[61] D. Couvin, A. Bernheim, C. Toffano-Nioche, M. Tou-chon, J. Michalik, B. Neron, E. P. C. Rocha,G. Vergnaud, D. Gautheret, and C. Pourcel, NucleicAcids Res. , W246 (2018). SUPPLEMENTARY INFORMATIONSI text on population dynamics
Consider a host population acquiring spacers of length l . Let the number of individuals in the population thathave repertoire size n ( n ≥
0) be X n . Consider spaceracquisition to occur at a rate b : X n b −→ X n +1 . (10)Spacer acquisition is balanced by spacer loss leading toa well-defined steady-state distribution of repertoire size.Spacer loss occurs through double-stranded DNA breaksfollowed by homologous recombination at a subsequentrepeat, which deletes chunks of the CRISPR array (seee.g. [46, 47]). The precise rate and mechanism by whichthis occurs is not well-understood. Here, we consider 3solvable scenarios of this process:1: X n d −→ X n − (11)2: X n dn −→ X n − (12)3: X n d −→ X . (13)The first scenario represents spacer loss at the end(s) ofthe CRISPR array, hence independent of n . The secondrepresents a constant per-spacer loss rate. For the thirdscenario, all spacers are lost in a deletion event, which isa solvable limit of several spacers being deleted at a time. Scenario 1: X n d −→ X n − . The probabilities P n ( n ≥
0) obey the following master equation:d P n d t = − ( b + d ) P n + bP n − + dP n +1 , n ≥ P d t = − bP + dP . (15)The steady state fulfills the detailed balance condition, dP n = bP n − . (16)We can solve the recursion equation (Eq. 16) for thesteady-state distribution, P n = ( b/d ) n (1 − b/d ) , (17)which is geometric with parameter 1 − b/d . Its mean is b/ ( b − d ), implying that a well-defined steady state is onlypossible if d > b . Scenario 2: X n dn −→ X n − . The master equation is:d P n d t = − ( b + dn ) P n + bP n − + d ( n + 1) P n +1 , n ≥ P d t = − bP + dP . (19)At steady state again detailed balance holds dnP n = bP n − . (20) Eq. 20 implies that the steady-state distribution is Pois-son with mean b/d : P n = 1 n ! ( b/d ) n e − b/d . (21) Scenario 3: X n d −→ X . Here, the master equation is:d P n d t = − ( b + d ) P n + bP n − , n ≥ P d t = − bP + d (1 − P ) . (23)Here there is no detailed balance, but probability flux isconserved, ( d + b ) P n = bP n − . (24)Eq. 24 implies that the steady-state distribution is geo-metric with parameter d/ ( b + d ): P n = (cid:20) bb + d (cid:21) n db + d . (25)The mean of this distribution is b/d .2 SI figures m e a n s p a c e r l e n g t h ln = (1.1 ± 0.1) + const. N l s ln = ln 4 + const. N l s scaling law, = /6, = 3.41 ± 0.02 k l k fix s var scaling law, = = 0 to = /6, = 8 k k k l k fix var fix s var data (with cas) 2,449 species(1 strain/species) Figure S1. Cross-reactivity parameters obtained by a fit to the empirical data lie in a plausible range. The blue points aredata from 2,449 species binned in increasing windows of repertoire size (50 species/bin), and the orange line is the linear fitto all species as in Fig. 2c. The green line is the naive ln 4 scaling (Eq. 5). The fitted slope is consistent with a broad rangeof cross-reactivity parameters (yellow region). A best-fit to Eq. 4 was performed, in which l p was fixed at 3, and k fix was setto l s /
6, consistently with a 6-nt periodicity in mismatch tolerance in type I-E systems [30, 31] and approximately 5 allowedmismatches in type II systems in which most spacer lengths are ∼
30 nt [25, 26]. We found best-fit values of k var = 3 . ± . ( L/P ) = 11 . ± .
02, where the errors are 90% confidence intervals. The estimate of k var is consistent with primedacquisition tolerating many mismatches, up to 10 in some systems [30, 33], and the estimate of L/P implies a maximum risk ofself-targeting P in the range of 10 − to 10 − . We expect these cross-reactivity parameters to show significant variation aroundthese means in individual species and systems (see Fig. 4). l ≤
30 31 ≤ ≤ l ≤ ≤ l ≤ ≤ l small medium large0.00.4 ≤ ≤ l small medium large l ≥ Usage of spacers of different lengthsrepertoire size bin
Figure S2. Variable usage of spacer lengths among multiple-type strains. 826 species with multiple CRISPR-Cas systemswere divided into 3 groups of small, medium and large repertoire sizes containing 275, 275 and 276 species, respectively. Eachrepertoire size bin was normalized to 1, so that the bars indicate the fraction of spacers in each repertoire size bin with thatlength. The usage of spacers of length ≤
32 nt decreases with repertoire size, while usage of spacers of length ≥
35 nt increaseswith repertoire size among these strains.
10 100spacer repertoire size123456789 nu m b e r o f l o c i CRISPR locicas loci
Figure S3. Spacer repertoire size is correlated with the number of CRISPR and cas loci. Data from 2,449 representative strainsbelonging to different species are binned by repertoire size (50 strains/bin). Error bars denote the standard error of the meanin each bin. m e a n s p a c e r l e n g t h One gene copy of both cas1 and cas2ln = (1.0 ± 0.1) + const.