[PDF] Heterologous autoimmunity and prokaryotic immune defense

Abstract

Some prokaryotes possess CRISPR-Cas systems that provide adaptive immunity to viruses guided by DNA segments called spacers acquired from invading phage. However, the patchy incidence and limited memory breadth of CRISPR-Cas systems suggest that their fitness benefits are offset by costs. Here, we propose that cross-reactive CRISPR targeting can lead to heterologous autoimmunity, whereby foreign spacers guide self-targeting in a spacer-length dependent fashion. Balancing antiviral defense against autoimmunity predicts a scaling relation between spacer length and CRISPR repertoire size. We find evidence for this scaling through comparative analysis of sequenced prokaryotic genomes, and show that this association also holds at the level of CRISPR types. In contrast, the scaling is absent in strains with nonfunctional CRISPR loci. Finally, we demonstrate that stochastic spacer loss can explain variations around the scaling relation, even between strains of the same species. Our results suggest that heterologous autoimmunity is a selective factor shaping the evolution of CRISPR-Cas systems.

Full PDF

AA scaling law in CRISPR repertoire sizes arises from avoidance of autoimmunity

Hanrong Chen, ∗ Andreas Mayer, † and Vijay Balasubramanian

1, 3 David Rittenhouse Laboratory, Department of Physics and Astronomy, University of Pennsylvania, USA Lewis-Sigler Institute for Integrative Genomics, Princeton University, USA Theoretische Natuurkunde, Vrije Universiteit Brussel, Belgium (Dated: January 6, 2021)Some bacteria and archaea possess an adaptive immune system that maintains a memory of pastviral infections as DNA elements called spacers, stored in the CRISPR loci of their genomes. Thismemory is used to mount targeted responses against threats. However, cross-reactivity of CRISPRtargeting mechanisms suggests that incorporation of foreign spacers can also lead to autoimmunity.We show that balancing antiviral defense against autoimmunity predicts a scaling law relating spacerlength and CRISPR repertoire size. By analyzing a database of microbial CRISPR-Cas systems,we ﬁnd that the predicted scaling law is realized empirically across prokaryotes, and arises throughthe proportionate use of diﬀerent CRISPR types by species diﬀering in the size of immune memory.In contrast, strains with nonfunctional CRISPR loci do not show this scaling. We also demonstratethat simple population-level selection mechanisms can generate the scaling, along with observedvariations between strains of a given species.

Clustered regularly interspaced short palindromic re-peats (CRISPR) and CRISPR-associated (Cas) proteinsform a prokaryotic defense against phage [1]. CRISPRloci are composed of DNA repeats alternating with vari-able DNA segments called spacers, acquired from phageand other foreign genetic material. In a process called in-terference, spacer RNA guides sequence-speciﬁc bindingand cleavage of target DNA by Cas proteins. In this way,spacers acquired during phage attack confer acquired,heritable resistance against subsequent invasions.CRISPR-Cas systems are remarkably diverse, char-acterized by functionally divergent Cas proteins anddistinct mechanisms for each stage of immune defense[2]. Spacer acquisition is mediated by the conservedCas1–Cas2 adaptation module, which sets spacer lengthswithin a narrow range varying by system [3, 4]. CRISPRarrays are also broadly distributed in size, ranging fromless than 10 to hundreds of spacers, and the full reper-toire of a host may comprise several CRISPR arrays [5].Maintaining a broad spacer repertoire confers resistanceagainst many phages and possible escape mutants [6].However, there are constitutive costs associated with Casprotein expression [7], and diminishing returns of broaddefense due to ﬁnite Cas protein copy numbers [8, 9]. Inaddition, CRISPR-Cas systems can prevent horizontaltransfer of beneﬁcial mobile genetic elements [10, 11].CRISPR-Cas systems also cause autoimmunity, occur-ring when a spacer guides interference somewhere on thehost genome, leading to cell death and strong mutationalpressure in the CRISPR-cas locus and target region [12–15]. The patchy incidence of CRISPR-Cas systems inprokaryotes (roughly 40% of bacteria and 85% of archaea[2]), and the presence of diverse mechanisms for self– ∗ These two authors contributed equally to this work.; Presentaddress: Computational & Systems Biology, Genome Instituteof Singapore, Singapore † These two authors contributed equally to this work. nonself discrimination [16], suggest that avoiding autoim-munity is a constraint in the evolution of CRISPR-Cassystems [2, 13–19].Several mechanisms exist in divergent CRISPR-Castypes for suppressing autoimmunity arising from diﬀer-ent forms of potential self-targeting [16]. In type I and IIsystems, interference requires presence of a protospacer-adjacent motif (PAM), a 2–5-nt-long sequence adjacentto target DNA but absent in CRISPR repeats, preventinginterference within the CRISPR array [20, 21]. In typeIII systems, interference requires transcription of targetDNA, which avoids targeting phages integrated into thehost chromosome (prophage) [22]. Spacers acquired fromthe host genome are naturally self-targeting, but thereare mechanisms to suppress such acquisition [23, 24]. Forexample, type I-E systems acquire spacers preferentiallyat double-stranded DNA breaks, which occur primarilyat stalled replication forks of replicating phage DNA, andacquisition is conﬁned by Chi sites which are enriched inbacterial genomes [23].Here we propose that CRISPR evolution is also shapedby heterologous autoimmunity , which occurs if an ac-quired foreign spacer and a segment of the host genomeare suﬃciently similar. The likelihood of this eﬀectdepends on sequence statistics and the speciﬁcity ofCRISPR targeting mechanisms. Heterologous autoim-munity is analogous to oﬀ-target eﬀects that are an im-portant concern in CRISPR-Cas genome editing [25, 26],but the possible eﬀects on prokaryotic adaptive immunityhave not been explored. We combine a probabilistic mod-eling approach with comparative analyses of CRISPRrepertoires across prokaryotes to show that: (a) het-erologous autoimmunity is a signiﬁcant threat caused byCRISPR-Cas immune defense, (b) avoidance of autoim-munity leads to a scaling law in CRISPR repertoires,and (c) the scaling law can be achieved by population-level selection. Our work suggests that avoidance of het-erologous autoimmunity is a key factor shaping CRISPRrepertoires and the evolution of CRISPR-Cas systems. a r X i v : . [ q - b i o . P E ] J a n I. RESULTSA. Cross-reactivity leads to autoimmunity

We approach heterologous self-targeting as a sequence-matching problem [27–29], and derive estimates for theprobability of a spacer being suﬃciently similar to at leastone site in the host genome. For a spacer of length l s and PAM of length l p (where it exists), an exact matchat a given position requires l ≡ l s + l p complementarynucleotides. In a host genome of length L , where L (cid:29) l ,there are L − l + 1 ≈ L starting positions for a match.At leading order, and ignoring nucleotide usage biases,we may treat matches as occurring independently withprobability 4 − l . Thus, the probability of an exact matchanywhere on the genome is (see Methods) p ≡ L − l , where l = l s + l p . (1)Considering order-of-magnitude parameter estimates forthe E. coli type I-E system of L = 5 × nt, l s = 32 nt,and l p = 3 nt gives a negligible probability p ∼ − .However, CRISPR interference tolerates several mis-matches between spacer RNA and target DNA depend-ing on position and identity [25, 26, 30, 31]. In general,mismatches in the PAM are not allowed, and mismatchesin the PAM-distal region are tolerated to a greater extentthan mismatches in the PAM-proximal region [21]. Up to ∼ k ﬁx ﬁxed positions, and (b) mismatches at k var variablepositions anywhere else in the target region. These in-crease the per-spacer self-targeting probability by a com-binatorial factor α ( k ﬁx , k var , l ) (see Methods), so that p self = α ( k ﬁx , k var , l ) p = α ( k ﬁx , k var , l ) L − l . (2)A greater number of allowed mismatches greatly in-creases the likelihood of heterologous self-targeting (Fig.1b). To gain intuition we can rewrite Eq. 2 as p self ≡ L − l eff , where (3) l eﬀ ( l, k ﬁx , k var ) ≡ l − log α ≈ l − k ﬁx − k var log l − k ﬁx ) , where l eﬀ is the eﬀective spacer length after discount-ing for allowed mismatches (see Methods). This shows that mismatches exponentially increase the probabilityof self-targeting, and variable-position mismatches par-ticularly so. Considering the E. coli system as before,the matching probability increases to p self ∼ − with k ﬁx = 5 nt and k var = 5 nt (see Fig. 1b). Other CRISPR-Cas systems may similarly lie in parameter regimes withappreciable p self , especially when including indirect self-targeting through primed acquisition [34]. Furthermore,the probability of self-targeting is likely higher than im-plied by our calculations as it can be increased by cor-relations in sequence statistics between host and phagegenomes [28, 29]. Given our estimates, we thus hypothe-size that heterologous autoimmunity may occur generallyand be a signiﬁcant cost of CRISPR-Cas immunity. B. Spacer length scales with repertoire size

To test this hypothesis, we exploited the large naturalvariability in CRISPR systems across diﬀerent microbialspecies. As the self-targeting probability depends expo-nentially on spacer length (Eqs. 2–3), we expect smalldiﬀerences in length to lead to large variations in the riskof autoimmunity. If CRISPR repertoire sizes are selectedto balance broader immunity against the risk of autoim-munity, then qualitatively we expect that species withshorter spacers should have smaller repertoires, whilespecies with longer spacers should have larger ones (Fig.1c).To make this prediction more quantitative, supposeprokaryotes tolerate a maximum probability P of self-targeting, and that CRISPR-Cas systems are selected tomaximize protection against pathogens subject to thisconstraint. Repertoires with N spacers incur a self-targeting probability of ∼ N p self , and thus Eq. 2 impliesln N = l ln 4 − ln α ( k ﬁx , k var , l ) − ln( L/P ) . (4)Linearizing the dependence of the combinatorial factor α around typical spacer lengths l (see Methods) predictsa scaling relationship between spacer length and the log-arithm of repertoire sizeln N ∼ l (cid:18) ln 4 − k var l − k ﬁx (cid:19) ∼ . l, (5)where we arrived at the latter estimate by taking k var ∼ k ﬁx ∼ l ∼ Figure 1.

CRISPR-Cas immune defense incurs a risk of heterologous autoimmunity. a , Sketch of the maincomponents of CRISPR-Cas immune defense. b , Per-spacer probability of heterologous self-targeting, p self , as a function of thenumber of tolerated mismatches at ﬁxed and variable positions along the spacer, k ﬁx and k var , respectively. c , We hypothesizethat the evolution of CRISPR-Cas systems is constrained by the risk of heterologous autoimmunity. As the self-targetingprobability depends strongly on spacer length, this predicts a scaling of repertoire size with spacer length. spacer repertoire sizes, deﬁned as the sum of CRISPRarray sizes in each genome, was broad, ranging from 1 to812 spacers (Fig. 2b).A linear regression between spacer length and log-repertoire size gave a slope of 1 . ± . k var = 3 . ± . k ﬁx = l s / Figure 2.

A scaling law relates CRISPR repertoire size and spacer length. a, b

Distributions of spacer lengths (a)and repertoire sizes (b) across prokaryotes. For each of 2,449 species with CRISPR and cas loci we randomly picked a singlestrain (see Methods), and calculated its repertoire size as the sum of all CRISPR array sizes present in the genome. The lengthdistribution of all spacers found in these ﬁltered strains are shown in (a). Bins in (b) were formed by dividing each decade into10 equal bins on a log scale. c , Scaling of repertoire size with spacer length. A linear ﬁt of the mean spacer length againstlog repertoire size was performed on all 2,449 species, and is shown alongside the data, which is binned by repertoire size(50 strains/bin). The ﬁtted slope is consistent with theory predictions (Eq. 5). d , Fraction of species with missing cas genesdecreases with spacer length. e , Spacer length and repertoire size do not show a clear relation in strains with nonfunctionalCRISPR loci. A linear ﬁt was performed on 340 species with CRISPR arrays but no cas loci, and is shown alongside the data,which is binned by repertoire size (50 strains/bin). Error bars in panels c–e denote the standard error of the mean in each bin,which in (d) are calculated assuming a binomial probability distribution for the absence of cas at each spacer length. C. Variable CRISPR-Cas type use underlies scaling

CRISPR-Cas systems are classiﬁed into diﬀerent typesand subtypes based on their evolutionary relationshipsand the use of diﬀerent cas genes [2]. We won-dered whether the aggregate scaling relationship betweenspacer length and repertoire size (Fig. 2) reﬂected dif-ferences at the level of CRISPR-Cas type usage. Wethus grouped the species by subtype, when there is a sin-gle CRISPR-Cas system in the genome, or in a separategroup when multiple subtypes are present.For species carrying a single cas type, we aggregatedall spacers found across species of each type to quantifythe statistics of acquired spacer lengths. We observed dif-ferences in the spacer length distributions between types(Fig. 3a): (a) Type II-A and II-C systems have narrowdistributions tightly clustered around 30 nt; (b) TypeI-E and I-F systems also have narrow distributions, clus-tered around 32 nt, while other type I systems have spac-ers that are longer and more broadly distributed; (c)Type III systems have even longer and more broadly dis- tributed spacers, with median lengths in the 36–39 ntrange.A broader distribution of acquired spacer lengths leadsto a higher risk of autoimmunity than a narrow distribu-tion with the same mean, since the self-targeting prob-ability increases exponentially for shorter spacers. Toaccount for an increase in autoimmune risk for broaderdistributions, we focused on the lower quartile of spacerlengths for each cas type as a proxy for autoimmune risk.Also, to account for the requirement of PAM recognitionin type I and II (but not type III) systems, we added aPAM length of 3 nt to types I and II to obtain the overalllength l . Strikingly, we observed that the predicted rela-tionship between l and repertoire size also broadly holdsbetween CRISPR-Cas types (Fig. 3b): Type II systemshave the shortest spacers and the smallest repertoires,and among type I subtypes those with shorter spacersgenerally have smaller repertoires. Type III systems havesmaller repertoires than type I systems despite somewhatlonger spacers, but this is explained by the absence ofPAMs and the broader spacer length distributions for Figure 3.

Preferential CRISPR-Cas type use underlies the scaling law relating spacer length and repertoire size.a , Length distributions of all spacers found in single-type strains aggregated by CRISPR-Cas type (for types with >

10 speciesin CRISPRCasdb [35]; see Methods). Also indicated are the median (solid vertical) and lower quartile (dotted vertical line) foreach distribution. The subtypes are presented in order of lower quartile. b , A trend is observed between spacer + PAM lengthand repertoire size for diﬀerent CRISPR-Cas types. For spacer lengths, the central dot is the lower quartile and the whiskerruns between the lowest decile and the median. Repertoire sizes are indicated as the mean ± standard error. To indicate therequirement of PAM recognition, a length of 3 nt was added to all type I and II (but not type III) subtypes. c , Variable usageof cas subtypes among multiple-type strains. A total of 826 strains with multiple CRISPR-Cas systems, randomly picked fromdiﬀerent species, were analyzed. They were divided into 3 groups of 275, 275 and 276 strains having small, medium, and largerepertoire size, respectively. Each subﬁgure was normalized to 1, so that the bars indicate the relative incidence of a subtypein each repertoire size bin. The order of subtypes is the same as panel a. the type III systems, both of which increase autoimmunerisk.We next tested whether this relation also carries overto species carrying multiple CRISPR-Cas systems, in theform of a diﬀerential use of cas types as a function ofrepertoire size. We divided species with multiple castypes into three equally sized groups by repertoire size,and determined the relative incidence of CRISPR sub-types within each group (Fig. 3c). We found that theuse of types II, I-E, and I-F decreases with repertoire sizein line with expectations, and an opposite pattern for twoof the type III systems and the type I systems with thelongest spacers. The relation between total repertoiresize and spacer length in species with multiple cas sub-types was further reinforced by a direct analysis of theincidence of spacers of diﬀerent lengths as a function ofrepertoire size, with a greater proportion of longer spac-ers present in larger repertoires (Fig. S2).Taken together, we ﬁnd that species carrying eithersingle or multiple CRISPR-Cas systems diﬀerentially useCRISPR-Cas types having diﬀerent spacer length distri-butions to form repertoires of diﬀerent sizes. This diﬀer-ential use gives rise to the aggregate scaling observed inFig. 2c, and is consistent with the hypothesis of minimiz-ing the risk of heterologous autoimmunity. D. Dynamical origin of the scaling law

Dynamical mechanisms can give rise to the scaling lawthat our theory predicts, and which is found in the em-pirical data. While spacer dynamics involves complexepidemiological feedbacks [8, 37–44], here we consider asimple eﬀective model in which spacer acquisition andloss are described as a birth-death process, such thatspacers are acquired at a rate b and lost at a per-spacerrate d (Fig. 4a, left panel). This gives rise to a Pois-son distribution of repertoire sizes at steady state, withmean b/d (see SI). Our statistical theory requires that themean of the distribution should shift with spacer length.There are two mechanisms by which selection could leadto such a dependence. First, the negative ﬁtness eﬀect ofacquiring self-targeting spacers [13] purges lineages thatundergo deleterious acquisition events. Indeed, CRISPRarrays are selected for the absence of self-targeting spac-ers [12]. Eﬀectively, this reduces the net acquisition rateamong surviving lineages. Second, over longer evolution-ary timescales, diﬀerent CRISPR-Cas systems may beselected to acquire spacers at diﬀerent rates dependingon their respective risks of autoimmunity. These diﬀer-ences in rates could arise from the maintenance of mul-tiple copies of cas genes, or through regulation of casexpression [45]. Indeed, spacer repertoire size increaseswith the number of cas loci (Fig. S3), suggesting thatlarger gene copy numbers of cas1 and cas2, necessary forspacer acquisition, result in greater acquisition rates. In-terestingly, strains having exactly one copy of both cas1 Figure 4.

A spacer acquisition-loss model with population-level selection of acquisition rates produces scalingas well as substantial variability between strains of the same species. a , In our model, strains acquire spacers at arate b and lose them with a per-spacer rate d , giving rise to a Poisson distribution at steady state with mean b/d (left panel). b is selected to minimize the risk of heterologous autoimmunity, such that species diﬀering in p self have diﬀerent mean repertoiresizes (middle panel). We generate a synthetic dataset of strains by sampling from steady-state distributions with diﬀerent spacerlengths and hence p self (see Methods). The synthetic data displays scaling of the mean and variability on the single-strain level(right panel). The green points show 100 individually sampled strains, the blue points means after binning by repertoire size(50 species/bin, 2,450 species total), and the orange line is a ﬁt to all 2,450 sampled points. b , Correlated spacer loss broadensthe predicted distribution of repertoire sizes. We consider a model in which all spacers are lost simultaneously during a deletionevent, which leads to a geometric steady-state distribution (see SI). Despite this additional variability, a synthetic samplegenerated as in panel a shows scaling of the means. c , The distributions of repertoire sizes of sequenced strains belonging to thesame species are broad. 4 pairs of species with >

50 sequenced strains and with the indicated CRISPR-Cas type are displayed.Vertical lines denote the mean for each species. and cas2 still obey a scaling relationship (Fig. S4), sug-gesting that regulation of these genes also contributes tominimizing autoimmune risk.Let us suppose that one or both of these selectionmechanisms lead to an eﬀective spacer acquisition rateinversely proportional to the risk of self-targeting, b ∝ /p self . To replicate the empirical analysis, we created asynthetic dataset of the same size by sampling strainsat random from steady-state distributions at diﬀerentspacer lengths, which have diﬀerent p self (Fig. 4a, mid-dle panel) (see Methods). Plotting spacer length againstmean repertoire size in the same way as we did for theempirical data, we recover a scaling law as predicted byour theory (Fig. 4a, right panel).In addition to providing a dynamical explanation forscaling of the means, this birth-death model producessubstantial variability around the mean relationship (Fig.4a, right panel, green dots). In fact, the Poisson vari- ance of Fig. 4a, originating from a constant per-spacerloss rate, is likely an underestimate. Spacer loss occursthrough double-stranded DNA breaks followed by ho-mologous recombination at a diﬀerent CRISPR repeat,a process which may delete chunks of an array in a singledeletion event (see e.g. [46, 47]). Including such cor-related spacer loss greatly increases the variance. Forexample, in a simple analytically solvable limiting casewhere entire arrays are lost at once, the distribution ofrepertoire sizes becomes geometric (see SI) and thus verybroad (Fig. 4b, left panel). Given this substantial vari-ability, we wondered whether a comparative analysis ofmany species could still recover a scaling in this model.We thus sampled strains from geometric steady-state dis-tributions whose means obey a scaling law as in panel a,observing a much larger variability in individual strains(Fig. 4b, right panel, green dots). However, the scalingof the mean is recovered with a ﬁt to the dataset andwhen strains are binned by repertoire size (Fig. 4b, rightpanel).Prompted by the broad variability predicted by cor-related spacer loss, we analyzed the repertoire size dis-tributions of species with many sequenced strains (seeMethods). We indeed observed a broad distribution evenamong strains of the same species (Fig. 4c). We com-pared the repertoire size distributions of four pairs ofhighly-sampled species with the same CRISPR-Cas type,and found that the within-species variability comprises asubstantial part of the overall variance. Additional vari-ability between species, leading to diﬀerent mean reper-toire sizes for species with the same CRISPR-Cas type,might originate from diﬀerent microbes inhabiting envi-ronments that diﬀer in viral diversity and thus pressureto acquire broad immune defense. We tested the robust-ness of the comparative analysis to this additional sourceof variability, by sampling strains from steady-state dis-tributions where we additionally sample the prefactor in b ∝ /p self from a wide distribution (see Methods). Thisfurther increases the variability of individually sampledstrains, but the means still show scaling (Fig. S5). II. DISCUSSION

An adaptive immune system is dangerous equipmentto have in an organism. There is always the risk that theimmune receptors, intended as defenses against foreigninvaders, will instead target the self. In CRISPR-Cassystems, biophysical mechanisms avoiding various formsof autoimmunity such as targeting of the CRISPR locusand self-spacer acquisition are known [16, 20, 21, 23], buthere we propose that heterologous autoimmunity, wherespacers acquired from foreign DNA seed self-targeting, isa signiﬁcant threat to microbes carrying CRISPR-Cas.This threat is analogous to oﬀ-target eﬀects in genome-editing applications [25, 26], and has been observed in anexperimental CRISPR-Cas system [33], but its wider im-plications for the evolution of CRISPR-Cas systems areunexplored. We showed that avoidance of this form of au-toimmunity while maximizing antiviral defense predictsa scaling law relating spacer length and CRISPR reper-toire size. The scaling depends on the number and natureof sequence mismatches permitted during CRISPR inter-ference and primed acquisition.To test our prediction we used a comparative approachanalyzing the natural variation in CRISPR-Cas systemsacross microbial species, and demonstrated that: (a) thepredicted scaling law is realized, (b) the observed scalingconstrains parameters for cross-reactive CRISPR target-ing to lie in a range consistent with experimental stud-ies, (c) the scaling arises in part from diﬀerential usageof diﬀerent CRISPR-cas subtypes having diﬀerent spacerlength distributions, and (d) the scaling, and hence a bal-anced tradeoﬀ between successful defense and autoimmu-nity, can be achieved by population-level selection mech-anisms. In addition, we demonstrated a negative control: CRISPR arrays in species that no longer have functionalCas proteins, and thus are not at risk of autoimmunity,do not show the predicted scaling relation. We proposetwo further tests of the link between spacer length andautoimmune risk: (1) If cross-reactivity leads to self-targeting, in addition to a depletion of self-targeting spac-ers in CRISPR arrays [12, 36], we predict a depletion ofspacers several mismatches away from self-targets, and(2) Our theory predicts that CRISPR-Cas subtypes withlonger spacers should acquire spacers more readily.A similar tradeoﬀ between sensitivity to pathogens andautoimmune risk shapes the evolution of vertebrate adap-tive immune systems [27, 48]. In the light of our results itwould be interesting to determine whether this tradeoﬀalso leads to a relation between the size of the immunerepertoire and speciﬁcity in vertebrates. Such a relationwill likely be harder to ascertain for vertebrates as pat-terns of cross-reactivity between lymphocyte receptorsand antigens are more complex. Interestingly, however,T cell receptor hypervariable regions in human are sev-eral nucleotides longer on average than those found inmice [49], which accompanies a substantial increase inrepertoire size in human. If longer hypervariable regionstranslate to a greater speciﬁcity on average, one mightview the increased human receptor length as an adapta-tion to a larger repertoire. The key to our current workwas the ability to compare microbial immune strategiesacross a large panel of phylogenetically distant species.Further insight into how this tradeoﬀ shapes vertebrateimmune systems might thus be gained by building onrecent eﬀorts to survey adaptive immune diversity in abroader range of vertebrates [50, 51].Many theoretical studies of adaptive immunity in bothprokaryotes [8, 37–44] and vertebrates [52–55] considerdetailed dynamical models of evolving immune reper-toires. For prokaryotes, such dynamical models can beregarded as describing the role of CRISPR-Cas as a short-term memory for defense against a co-evolving phage[56]. Studying adaptive immunity in this way requiresdetailed knowledge of the parameters controlling the dy-namics, many of which are not well-characterized experi-mentally. In this paper, we took an alternative approachof focusing on the statistical logic of adaptive immunity,where we regard the bacterial immune system as a func-tional mechanism for maintaining a long-term memory[56] of a diverse phage landscape [57], via probabilisticmatching of genomic sequences. Previous work takingthis perspective oﬀered an explanation for why prokary-otic spacer repertoires lie in the range of a few dozen toa few hundred spacers [56]. As in our discussion of possi-ble mechanisms for generating the observed scaling law,evolution should select dynamics that achieve the statis-tical organization that we predict, because this is what isuseful for achieving a broad defense against phage whileavoiding autoimmunity. A probability theory perspec-tive of this kind has been applied to the logic of theadaptive immune repertoire of vertebrates [58–60], butto our knowledge we are presenting a novel approach tothe study of CRISPR-based autoimmunity.

III. MATERIALS AND METHODS a. Derivation of self-targeting probability.

We esti-mate the probability of an alignment between a spacer +PAM sequence of length l and a host genome of length L . We assume that both sequences are random and un-correlated, with nucleotide usage frequencies of 1/4. Ina length- L genome, where L (cid:29) l , there are L − l + 1 ≈ L starting positions for an alignment. The matching prob-ability at each position, p m , depends on the number andnature of mismatches tolerated. In regimes where p m issmall, the matching probabilities at the diﬀerent posi-tions may be treated independently. Thus, the probabil-ity of having at least one alignment within the length- L genome is p self = 1 − (1 − p m ) L − l +1 ≈ Lp m , since p m (cid:28) , l (cid:28) L. (6)If no mismatches are tolerated, p m = 4 − l as in Eq. 1.At each site where a mismatch is allowed, four alterna-tive nucleotide choices are possible. This gives a certainnumber α of unique complementary sequences matchingto a given spacer, which we compute as a function ofthe number and nature of mismatches. If up to k ﬁx mis-matches are tolerated at ﬁxed positions in the alignment, α = 4 k fix . If instead up to k var mismatches are tol-erated anywhere in the complementary region, naively α ∼ (cid:0) lk var (cid:1) k var , where the binomial coeﬃcient is thenumber of combinations of sites where mismatches areallowed. This is however an upper bound as matchingsequences are overcounted, and the precise expression is α = k var (cid:88) i =0 (cid:18) li (cid:19) i , (7)where each term in the sum is the number of unique com-plementary sequences having exactly i mismatches. Thelargest term dominates, giving α ≈ (cid:0) lk var (cid:1) k var . Thus,combining k ﬁx mismatches at ﬁxed positions and up to k var mismatches at any of the remaining l − k ﬁx positionsgives α ( k ﬁx , k var , l ) ≈ k fix (cid:18) l − k ﬁx k var (cid:19) k var . (8)We can introduce an eﬀective spacer length, l eﬀ , by p m ≡ − l eff . To leading order the binomial expression in Eq. 8is approximated by ( l − k ﬁx ) k var . This gives l eﬀ ≈ l − k ﬁx − k var log l − k ﬁx ) as in Eq. 3. The probability that a repertoire of N spacers avoidsself-targeting, 1 − P self , is one minus the probability thatat least one spacer self-targets. This gives P self = 1 − (1 − p self ) N ≈ N p self , since Lp m (cid:28) . (9)If CRISPR repertoires are selected to maximize reper-toire size subject to the constraint P self ≤ P , we obtainEq. 4. Taylor expanding ln N around l = l gives Eq. 5to lowest order in l . b. Comparative analyses. For our comparative anal-yses we use CRISPRCasdb [5], which is a database ofCRISPR and cas loci identiﬁed using CRISPRCasFinder[61] in public bacterial and archaeal whole-genome as-semblies [35]. CRISPR arrays are assigned evidence lev-els 1–4, 4 being the highest conﬁdence [61]. We restrictedour analysis to level 4 CRISPR arrays only. Strains con-taining both annotated CRISPR and cas loci were usedfor the analyses in Figs. 2a–c, 3, and 4c. Strains contain-ing annotated CRISPR but no cas loci were used for theanalyses in Figs. 2d–e. In order to eliminate oversam-pling of certain species, we picked one strain at randomfrom each species for further analysis (2,449 species withannotated CRISPR and cas loci, and 340 species withannotated CRISPR but no cas loci). To produce Fig. 3,the randomly chosen strains were grouped by annotatedcas subtype, or into a separate group if they contain mul-tiple subtypes. The 12 subtypes shown in Figs. 3a and chave >

10 species represented in CRISPRCasdb. To pro-duce Fig. 4c, 4 pairs of species, each with >

50 sequencedstrains of the same CRISPR-Cas type, were chosen foranalysis. c. Synthetic data generation and analysis.

A syn-thetic dataset producing a scaling law was generatedin the following way: (1) A spacer of length l s wasdrawn from the length distribution of Fig. 2a, and (2)a repertoire size distribution with mean A/p self was cre-ated, from which one strain was sampled and added tothe dataset. Parameter values of L = 5 · , l p = 3, k ﬁx = l s / k var = 3, and A = 10 − . were used. Thesteady-state distributions are Poisson in Fig. 4a, and ge-ometric with the same mean in Fig. 4b. In Fig. S5, A wassampled from a log-normal distribution with the samemean, and standard deviation chosen such that the coef-ﬁcient of variation is 1.2. Acknowledgements.

We thank Serena Bradde forhelpful comments on this paper. VB and HC were sup-ported in part by a Simons Foundation grant in Mathe-matical Modeling for Living Systems (400425) for Adap-tive Molecular Sensing in the Olfactory and Immune Sys-tems, and by the NSF Center for the Physics of Biolog-ical Function (PHY-1734030). AM was supported by aLewis–Sigler fellowship. VB thanks the Aspen Center forPhysics, which is supported by NSF grant PHY-160761,for hospitality during this work. [1] R. Barrangou, C. Fremaux, H. Deveau, M. Richards,P. Boyaval, S. Moineau, D. A. Romero, and P. Horvath,Science , 1709 (2007).[2] K. S. Makarova, Y. I. Wolf, J. Iranzo, S. A. Shmakov,O. S. Alkhnbashi, S. J. J. Brouns, E. Charpentier,D. Cheng, D. H. Haft, P. Horvath, S. Moineau, F. J. M.Mojica, D. Scott, S. A. Shah, V. Siksnys, M. P. Terns,C. Venclovas, M. F. White, A. F. Yakunin, W. Yan,F. Zhang, R. A. Garrett, R. Backofen, J. van der Oost,R. Barrangou, and E. V. Koonin, Nat. Rev. Microbiol. , 67 (2020).[3] J. Wang, J. Li, H. Zhao, G. Sheng, M. Wang, M. Yin,and Y. Wang, Cell , 840 (2015).[4] J. K. Nunez, L. B. Harrington, P. J. Kranzusch, A. N.Engelman, and J. A. Doudna, Nature , 535 (2015).[5] C. Pourcel, M. Touchon, N. Villeriot, J. P. Vernadet,D. Couvin, C. Toﬀano-Nioche, and G. Vergnaud, NucleicAcids Res. , D535 (2020).[6] S. van Houte, A. K. E. Ekroth, J. M. Broniewski,H. Chabas, B. Ashby, J. Bondy-Denomy, S. Gandon,M. Boots, S. Paterson, A. Buckling, and E. R. Westra,Nature , 385 (2016).[7] P. F. Vale, G. Laﬀorgue, F. Gatchitch, R. Gardan,S. Moineau, and S. Gandon, Proceedings of the RoyalSociety B: Biological Sciences , 20151270 (2015).[8] A. Martynov, K. Severinov, and I. Ispolatov, PLoSComp. Bio. , 1 (2017).[9] S. Bradde, A. Nourmohammad, S. Goyal, and V. Bal-asubramanian, Proc. Natl. Acad. Sci. U.S.A. , 5144(2020).[10] L. A. Marraﬃni and E. J. Sontheimer, Science , 1843(2008).[11] W. Jiang, I. Maniv, F. Arain, Y. Wang, B. R. Levin, andL. A. Marraﬃni, PLoS Genetics , e1003844 (2013).[12] A. Stern, L. Keren, O. Wurtzel, G. Amitai, and R. Sorek,Trends Genet. , 335 (2010).[13] R. B. Vercoe, J. T. Chang, R. L. Dy, C. Taylor, T. Grist-wood, J. S. Clulow, C. Richter, R. Przybilski, A. R. Pit-man, and P. C. Fineran, PLoS Genet. , e1003454 (2013).[14] D. Paez-Espino, W. Morovic, C. L. Sun, B. C. Thomas,K. Ueda, B. Stahl, R. Barrangou, and J. F. Banﬁeld, Nat.Commun. , 1430 (2013).[15] Y. Wei, R. M. Terns, and M. P. Terns, Genes Dev. ,356 (2015).[16] L. A. Marraﬃni, Nature , 55 (2015).[17] R. Edgar and U. Qimron, J. Bacteriol. , 6291 (2010).[18] G. W. Goldberg, E. A. McMillan, A. Varble, J. W. Mod-ell, P. Samai, W. Jiang, and L. A. Marraﬃni, Nat. Com-mun. , 61 (2018).[19] C. Rollie, A. Chevallereau, B. N. J. Watson, T. Y. Chyou,O. Fradet, I. McLeod, P. C. Fineran, C. M. Brown,S. Gandon, and E. R. Westra, Nature , 149 (2020).[20] H. Deveau, R. Barrangou, J. E. Garneau, J. Labont´e,C. Fremaux, P. Boyaval, D. A. Romero, P. Horvath, andS. Moineau, J. Bacteriol. , 1390 (2008).[21] E. Semenova, M. M. Jore, K. A. Datsenko, A. Semenova,E. R. Westra, B. Wanner, J. van der Oost, S. J. Brouns,and K. Severinov, Proc. Natl. Acad. Sci. U.S.A. ,10098 (2011).[22] G. W. Goldberg, W. Jiang, D. Bikard, and L. A. Marraf-ﬁni, Nature , 633 (2014). [23] A. Levy, M. G. Goren, I. Yosef, O. Auster, M. Manor,G. Amitai, R. Edgar, U. Qimron, and R. Sorek, Nature , 505 (2015).[24] J. W. Modell, W. Jiang, and L. A. Marraﬃni, Nature , 101 (2017).[25] Y. Fu, J. A. Foden, C. Khayter, M. L. Maeder, D. Reyon,J. K. Joung, and J. D. Sander, Nat. Biotechnol. , 822(2013).[26] P. D. Hsu, D. A. Scott, J. A. Weinstein, F. A. Ran,S. Konermann, V. Agarwala, Y. Li, E. J. Fine, X. Wu,O. Shalem, T. J. Cradick, L. A. Marraﬃni, G. Bao, andF. Zhang, Nat. Biotechnol. , 827 (2013).[27] J. K. Percus, O. E. Percus, and A. S. Perelson, Proc.Natl. Acad. Sci. U.S.A. , 1691 (1993).[28] S. Karlin and S. F. Altschul, Proc. Natl. Acad. Sci. U.S.A. , 2264 (1990).[29] A. Dembo, S. Karlin, and O. Zeitouni, The Annals ofProbability , 2022 (1994).[30] P. C. Fineran, M. J. Gerritzen, M. Su´arez-Diez,T. K¨unne, J. Boekhorst, S. A. van Hijum, R. H. Staals,and S. J. Brouns, Proc. Natl. Acad. Sci. U.S.A. ,E1629 (2014).[31] C. Jung, J. A. Hawkins, S. K. Jones, Y. Xiao, J. R. Ry-barski, K. E. Dillard, J. Hussmann, F. A. Saifuddin, C. A.Savran, A. D. Ellington, A. Ke, W. H. Press, and I. J.Finkelstein, Cell , 35 (2017).[32] K. A. Datsenko, K. Pougach, A. Tikhonov, B. L. Wanner,K. Severinov, and E. Semenova, Nat. Commun. , 945(2012).[33] R. H. Staals, S. A. Jackson, A. Biswas, S. J. Brouns,C. M. Brown, and P. C. Fineran, Nat. Commun. , 12853(2016).[34] T. J. Nicholson, S. A. Jackson, B. I. Croft, R. H. Staals,P. C. Fineran, and C. M. Brown, RNA Biology , 566(2019).[35] Grissa, Ibtissem and Drevet, Christine and Couvin,David, CRISPRCasdb (2020), [Online; accessed 26-July-2020].[36] K. E. Watters, C. Fellmann, H. B. Bai, S. M. Ren, andJ. A. Doudna, Science , 236 (2018).[37] J. He and M. W. Deem, Physical Review Letters ,128102 (2010).[38] B. R. Levin, PLoS Genet. , e1001171 (2010).[39] L. M. Childs, N. L. Held, M. J. Young, R. J. Whitaker,and J. S. Weitz, Evolution , 2015 (2012).[40] B. R. Levin, S. Moineau, M. Bushman, and R. Barran-gou, PLoS Genet. , e1003312 (2013).[41] J. Iranzo, A. E. Lobkovsky, Y. I. Wolf, and E. V. Koonin,J. Bacteriol. , 3834 (2013).[42] A. D. Weinberger, C. L. Sun, M. M. Pluci´nski, V. J.Denef, B. C. Thomas, P. Horvath, R. Barrangou, M. S.Gilmore, W. M. Getz, and J. F. Banﬁeld, PLoS Comp.Biol. , e1002475 (2012).[43] S. Bradde, M. Vucelja, T. Te¸sileanu, and V. Bal-asubramanian, PLoS Comp. Bio. , 1 (2017),arXiv:1510.06082.[44] P. Han and M. W. Deem, Journal of the Royal SocietyInterface (2017).[45] A. G. Patterson, M. S. Yevstigneyeva, and P. C. Fineran,Current Opinion in Microbiology , 1 (2017). [46] G. W. Tyson and J. F. Banﬁeld, Environ. Microbiol. ,200 (2008).[47] P. Horvath, D. A. Romero, A. C. Coˆut´e-Monvoisin,M. Richards, H. Deveau, S. Moineau, P. Boyaval, C. Fre-maux, and R. Barrangou, J. Bacteriol. , 1401 (2008).[48] C. J. E. Metcalf, A. T. Tate, and A. L. Graham, NatureEcology & Evolution , 1 (2017).[49] Z. Sethna, Y. Elhanati, C. S. Dudgeon, C. G. Callan,A. J. Levine, T. Mora, and A. M. Walczak, Proc. Natl.Acad. Sci. U.S.A. , 201700241 (2017).[50] R. Castro, S. Navelsaker, A. Krasnov, L. Du Pasquier,and P. Boudinot, Developmental and comparative im-munology , 28 (2017).[51] R. Covacu, H. Philip, M. Jaronen, D. C. Douek, S. Efroni,F. J. Quintana, R. Covacu, H. Philip, M. Jaronen,J. Almeida, J. E. Kenison, and S. Darko, Cell Reports , 2733 (2016).[52] R. Antia, V. V. Ganusov, and R. Ahmed, Nature ReviewsImmunology , 101 (2005).[53] G. Lythe, R. E. Callard, R. L. Hoare, and C. Molina-Par´ıs, Journal of Theoretical Biology , 214 (2016).[54] J. Desponds, A. Mayer, T. Mora, and A. M. Walczak,arXiv preprint arXiv:1703.00226 (2017).[55] M. Gaimann, M. Nguyen, J. Desponds, and A. Mayer,eLife , e61639 (2020).[56] S. Bradde, A. Nourmohammad, S. Goyal, and V. Bal-asubramanian, Proc. Natl. Acad. Sci. U.S.A. , 5144(2020).[57] R. A. Edwards and F. Rohwer, Nature Reviews Microbi-ology , 504 (2005).[58] A. Mayer, V. Balasubramanian, T. Mora, and A. M. Wal-czak, Proc. Natl. Acad. Sci. U.S.A. , 5950 (2015).[59] A. Mayer, T. Mora, O. Rivoire, and A. M. Walczak, Proc.Natl. Acad. Sci. U.S.A. , 8630 (2016).[60] A. Mayer, V. Balasubramanian, A. M. Walczak, andT. Mora, Proc. Natl. Acad. Sci. U.S.A. , 8815 (2019).[61] D. Couvin, A. Bernheim, C. Toﬀano-Nioche, M. Tou-chon, J. Michalik, B. Neron, E. P. C. Rocha,G. Vergnaud, D. Gautheret, and C. Pourcel, NucleicAcids Res. , W246 (2018). SUPPLEMENTARY INFORMATIONSI text on population dynamics

Consider a host population acquiring spacers of length l . Let the number of individuals in the population thathave repertoire size n ( n ≥

0) be X n . Consider spaceracquisition to occur at a rate b : X n b −→ X n +1 . (10)Spacer acquisition is balanced by spacer loss leading toa well-deﬁned steady-state distribution of repertoire size.Spacer loss occurs through double-stranded DNA breaksfollowed by homologous recombination at a subsequentrepeat, which deletes chunks of the CRISPR array (seee.g. [46, 47]). The precise rate and mechanism by whichthis occurs is not well-understood. Here, we consider 3solvable scenarios of this process:1: X n d −→ X n − (11)2: X n dn −→ X n − (12)3: X n d −→ X . (13)The ﬁrst scenario represents spacer loss at the end(s) ofthe CRISPR array, hence independent of n . The secondrepresents a constant per-spacer loss rate. For the thirdscenario, all spacers are lost in a deletion event, which isa solvable limit of several spacers being deleted at a time. Scenario 1: X n d −→ X n − . The probabilities P n ( n ≥

0) obey the following master equation:d P n d t = − ( b + d ) P n + bP n − + dP n +1 , n ≥ P d t = − bP + dP . (15)The steady state fulﬁlls the detailed balance condition, dP n = bP n − . (16)We can solve the recursion equation (Eq. 16) for thesteady-state distribution, P n = ( b/d ) n (1 − b/d ) , (17)which is geometric with parameter 1 − b/d . Its mean is b/ ( b − d ), implying that a well-deﬁned steady state is onlypossible if d > b . Scenario 2: X n dn −→ X n − . The master equation is:d P n d t = − ( b + dn ) P n + bP n − + d ( n + 1) P n +1 , n ≥ P d t = − bP + dP . (19)At steady state again detailed balance holds dnP n = bP n − . (20) Eq. 20 implies that the steady-state distribution is Pois-son with mean b/d : P n = 1 n ! ( b/d ) n e − b/d . (21) Scenario 3: X n d −→ X . Here, the master equation is:d P n d t = − ( b + d ) P n + bP n − , n ≥ P d t = − bP + d (1 − P ) . (23)Here there is no detailed balance, but probability ﬂux isconserved, ( d + b ) P n = bP n − . (24)Eq. 24 implies that the steady-state distribution is geo-metric with parameter d/ ( b + d ): P n = (cid:20) bb + d (cid:21) n db + d . (25)The mean of this distribution is b/d .2 SI ﬁgures m e a n s p a c e r l e n g t h ln = (1.1 ± 0.1) + const. N l s ln = ln 4 + const. N l s scaling law, = /6, = 3.41 ± 0.02 k l k fix s var scaling law, = = 0 to = /6, = 8 k k k l k fix var fix s var data (with cas) 2,449 species(1 strain/species) Figure S1. Cross-reactivity parameters obtained by a ﬁt to the empirical data lie in a plausible range. The blue points aredata from 2,449 species binned in increasing windows of repertoire size (50 species/bin), and the orange line is the linear ﬁtto all species as in Fig. 2c. The green line is the naive ln 4 scaling (Eq. 5). The ﬁtted slope is consistent with a broad rangeof cross-reactivity parameters (yellow region). A best-ﬁt to Eq. 4 was performed, in which l p was ﬁxed at 3, and k ﬁx was setto l s /

6, consistently with a 6-nt periodicity in mismatch tolerance in type I-E systems [30, 31] and approximately 5 allowedmismatches in type II systems in which most spacer lengths are ∼

30 nt [25, 26]. We found best-ﬁt values of k var = 3 . ± . ( L/P ) = 11 . ± .

02, where the errors are 90% conﬁdence intervals. The estimate of k var is consistent with primedacquisition tolerating many mismatches, up to 10 in some systems [30, 33], and the estimate of L/P implies a maximum risk ofself-targeting P in the range of 10 − to 10 − . We expect these cross-reactivity parameters to show signiﬁcant variation aroundthese means in individual species and systems (see Fig. 4). l ≤

30 31 ≤ ≤ l ≤ ≤ l ≤ ≤ l small medium large0.00.4 ≤ ≤ l small medium large l ≥ Usage of spacers of different lengthsrepertoire size bin

Figure S2. Variable usage of spacer lengths among multiple-type strains. 826 species with multiple CRISPR-Cas systemswere divided into 3 groups of small, medium and large repertoire sizes containing 275, 275 and 276 species, respectively. Eachrepertoire size bin was normalized to 1, so that the bars indicate the fraction of spacers in each repertoire size bin with thatlength. The usage of spacers of length ≤

32 nt decreases with repertoire size, while usage of spacers of length ≥

35 nt increaseswith repertoire size among these strains.

10 100spacer repertoire size123456789 nu m b e r o f l o c i CRISPR locicas loci

Figure S3. Spacer repertoire size is correlated with the number of CRISPR and cas loci. Data from 2,449 representative strainsbelonging to diﬀerent species are binned by repertoire size (50 strains/bin). Error bars denote the standard error of the meanin each bin. m e a n s p a c e r l e n g t h One gene copy of both cas1 and cas2ln = (1.0 ± 0.1) + const.