Immune Fingerprinting through Repertoire Similarity
Thomas Dupic, Meriem Bensouda Koraichi, Anastasia Minervina, Mikhail Pogorelyy, Thierry Mora, Aleksandra M. Walczak
IImmune fingerprinting
Thomas Dupic, Meriem Bensouda Koraichi, Anastasia Minervina, Mikhail Pogorelyy,
3, 4
Thierry Mora, ∗ and Aleksandra M. Walczak ∗ Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA Laboratoire de physique de l’ ´Ecole Normale Sup´erieure, CNRS, Sorbonne Universit´e, Universit´ede Paris, and ´Ecole normale sup´erieure (PSL), 24 rue Lhomond, 75005 Paris, France Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia Pirogov Russian National Research Medical University, Moscow, Russia
Immune repertoires provide a unique fingerprint reflecting the immune history of individuals,with potential applications in precision medicine. However, the question of how personal thatinformation is and how it can be used to identify people has not been explored. Here, we showthat individuals can be uniquely identified from repertoires of just a few thousands lymphocytes.We present “Immprint,” a classifier using an information-theoretic measure of repertoire similarityto distinguish pairs of repertoire samples coming from the same versus different individuals. Usingpublished data and statistical modeling, we tested its ability to identify individuals with greataccuracy, including identical twins, by computing false positive and false negative rates < − using 10,000 cells. The method is robust to acute infections and the passage of time. These resultsemphasize the private and personal nature of repertoire data. Personalized medicine is a frequent promise of next-generation sequencing. These high-throughput and low-cost sequencing technologies hold the potential of tai-lored treatment for each individual. However, progresscomes with privacy concerns. Genome sequences cannotbe anonymized: a genetic fingerprint is in itself enoughto fully identify an individual, with the rare exception ofmonozygotic twins. The privacy risks brought by thesepseudonymized genomes have been highlighted by multi-ple studies [1–3], and the approach is now routinely usedby law enforcement. Sequencing experiments that focuson a limited number of expressed genes should be lessprone to these concerns. However, as we will show, B-and T-cell receptor (BCR and TCR) genes are an excep-tion to this rule.BCR and TCR are randomly generated through so-matic recombination [4], and the fate of each B- or T-cellclone depends on the environment and immune history.The immune repertoire, defined as the set of BCR orTCR expressed in an individual, has been hailed a faith-ful, personalized medical record, and repertoire sequenc-ing (RepSeq) as a poential tool of choice in personalizedmedicine [5–9]. In this report we show that each person’srepertoire is truly unique. We describe how, from smallquantities of blood (blood spot or heel prick), one canextract enough information to uniquely identify an indi-vidual, providing an immune fingerprint, which we call“Immprint”.Given two samples of peripheral blood respectivelycontaining M and M T cells, we want to distinguishbetween two hypothetical scenarios: either the two sam-ples come from the same individual (“autologous” sce- ∗ These authors contributed equally. Correspondance should besent to [email protected] , [email protected] nario), or they were obtained from two different individ-uals (“heterologous” scenario), see Fig. 1a.TCR are formed by two protein chains α and β . Theyeach present a region of high somatic variability, labeledCDR3 α and CDR3 β , randomly generated during the re-combination process. These regions are coded by shortsequences (around 50 nucleotides), which are capturedby RepSeq experiments. The two chains are usually notsequenced together so that the pairing information be-tween α and β is lost. Most experiments focus on the β chain, and we will focus on that chain, but the results arelargely independent of this choice. CDR3 β sequences arevery diverse, with more than 10 possible sequences [10].For comparison, a human TCR repertoire is composed of10 to 10 unique clonotypes [11, 12]. As a result, mostof the sequences found in a repertoire are “private”.To discriminate between the autologous and heterolo-gous scenarios, one can count the number of nucleotidereceptor sequences, S , shared between the two samples.Samples coming from the same individual should havemore receptors in common because T-cells are organizedin clones of cells carrying the same TCR. By contrast, S should be low in pairs of samples from different individ-uals, in which sharing is due to rare convergent recom-binations. Appropriately setting a threshold to jointlyminimize the rates of false positives and false negatives(Fig. 1b), we can use S as a classifier to distinguish au-tologous from heterologous samples.The S score can be improved upon by exploiting thefact that some receptors are much more likely than oth-ers to be generated during V(D)J-recombination, withvariations in generation probability ( P gen , [13–15]) span-ning 15 orders of magnitude. Public sequences (with high P gen ) are likely to be found in multiple individuals [16],while rare sequences (low P gen ) are unlikely to be sharedby different individuals, and thus provide strong evidencefor the autologous scenario when found in both samples. a r X i v : . [ q - b i o . GN ] J un A B A BS = 2I = 19.7 S = 1I = 3.3Autologous Heterologous P ub li c P ub li c P r i v a t e a. › S fi › I fi c. − − − − False positive rate − − − − − F a l s e n e g a ti v e r a t e SIM = 3000M = 5000M = 10000 e. Summary statistic (I or S)Heterologous AutologousTrue neg.True pos.False pos.False neg. b. P r opo r ti on
200 400Number of shared sequences S 0.000.020.04HeterologousHeterologous, TwinsAutologous d. − − − − − − − − AU R O C S( )I( )S( )I( ) f. FIG. 1: a) The two samples A and B can either originate from the same individual (autologous) or two different individuals(heterologous). In both scenarios, sequences can be shared between the two samples, but their quantity and quality vary. b)Schematic representation of the distribution of the S or I scores in both scenarios. The dashed vertical line represents thethreshold value. c) Expected value of S and I for different pairs of samples, sampled from the same individual (in blue) ordifferent ones (orange). Red dots represent samples extracted from pairs of twins. The dashed lines represents the theoreticalupper bound (see Methods) for both S and I ( γ = 12). d) Distribution of S in both scenarios (orange heterologous, blueautologous) for different pairs of samples, M = 5000. The distributions in red correspond to a pair of samples extractedfrom twins. e) Detection Error Trade-off (DET) graph for both summary statistics and different sample sizes M . I ( γ = 12)outperforms S in all scenarios. f) AUROC (Area Under Receiver Operating Characteristic), as a function of M . The AUROCis a traditional measure of the quality of a binary classifier (a score closer to one indicates a better classifier). The results areshown for S and I both in the default case (only the β chain considered) or for the full ( α - β ) receptor. To account for this information, we define the score: I = (cid:88) shared s [ln (1 /P gen ( s )) − γ ] , (1)which accounts for Shannon’s “surprise” ln(1 /P gen )—ameasure of unexpectedness—associated with each sharedsequence s , so that rare shared sequences count morethan public ones. The constant γ depends on the reper-toire’s clonal structure and is set to 12 in the following(see Methods for an information-theoretic derivation). P gen is computed using models previously trained on datafrom multiple individuals [14]. Small differences reportedbetween the P gen of distinct individuals justify the use ofa universal model [15].We tested the classifiers based on the S and I scoreson TCR β RepSeq datasets from 656 individuals [17]. Se-quences were downsampled to mimic experiments where M = M = M cells were analyzed (including a pro-cedure to correct for the limited diversity of the sam-pled repertoire relative to the full repertoire, see Meth-ods I). Similar results may be obtained when M and M are different (see Methods). In Fig. 1c, we plot themean value of S (over many draws of M = 5000 recep-tors) for each individual (autologous scenario, in blue) and between pairs of different individuals (heterologousscenario, in orange). The two scenarios are clearly dis-cernable under both scores This result holds for pairs ofmonozygotic twins obtained from a distinct dataset [18](pink dots), consistent with previous reports that twinsdiffer almost as much in their repertoires as unrelated in-dividuals [18–20]. Heterologous scores (orange dots) varylittle, and may be bounded from above by a theoreticalprediction (dashed line) based on a model of recombina-tion [21] (see Methods). On the other hand, autologousscores (blue dots) show several orders of magnitude ofvariability across individuals. These variations stem fromthe clonal structure of the repertoire, and correlates withmeasures of diversity (Fig. S1), which is known to varya lot between individuals and correlates with age [22],serological status, and infectious disease history [23, 24].To explore the worst case scenario of discriminability,hereafter we will focus on the individual with the lowestautologous S found in the dataset.The sampling process introduces an additional sourceof variability within each individual. Two samples ofblood from the same individual do not contain the exactsame receptors, and the values of S and I is expected tovary between replicates. Example of distributions for S between different pairs of replicates in the same (blue)and in different individual are given in Fig. 1d. The dis-tribution of S is well-approximated by a Poisson distribu-tion, while I follows approximately a compound distribu-tion of a normal and Poisson distributions (see Methodsfor details). Armed with these statistical models of varia-tions, we can predict upper bounds for the false negativeand false positive rates. As seen from the detection errortrade-off (DET) graph Fig. 1e, the Immprint classifierperforms very well for a few thousand receptors with anadvantage for I .With 10 ,
000 cells, corresponding to ∼ µ L of blood,Immprint may simultaneously achieve a false positiverate of < − and false negative rate of < − , al-lowing for the near-certain identification of an individ-ual in pairwise comparisons against the world population ∼ . When a large reference repertoire has been col-lected ( M = 1 , , ∼ I score out-performs the S score (Fig. 1f), particularly above moder-ate sample sizes ( M ≈ αβ , whenthe pairing of the two chains is available (through single-cell sequencing [25–27] or computational pairing [28]),using P gen ( α, β ) = P gen ( α ) × P gen ( β ) [29] for the gener-ation probability of the full TCR. Because coincidentalsharing of both chains is substantially rarer than with the β chain alone, using the paired chain information greatlyimproves the classifier.The previous results used samples obtained at the sametime. However, immune repertoires are not static: inter-action with pathogens and natural aging modify theircomposition. The evolution of clonal frequencies willdecrease Immprint’s reliability with time, especially ifthe individual has experienced immune challenges in themeantime.To study the effect of short-term infections, we ana-lyzed an experiment where 6 individuals were vaccinatedwith the yellow fever vaccine, which is regarded as a goodmodel of acute infection, and their immune system wasmonitored regularly through blood draws [18]. We ob-serve an only moderate drop in S caused by vaccination(Fig. 2a). This is consistent with the fact that infectionslead to the strong expansion of only a limited number ofclones, while the rest of the immune system stays stable[30–33]. While other types of infections, auto-immunediseases, and cancers may affect Immprint in more sub-stantial ways, our result suggests that it is relatively ro-bust to changes in condition.We then asked how stable Immprint is over long times.Addressing this issue is hampered by the lack of longi- tudinal datasets over long periods, so we turn to math-ematical models [12, 34–37] to describe the dynamics ofthe repertoire. Following the model of fluctuating growthrate described in Ref. [36], we define two typical evolu-tionary timescales for the immune system: τ , the typicalturnover rate of T-cell clones, and θ , which representsthe typical time for a clonotype to grow or shrink bya factor two as its growth rate fluctuates. The modelpredicts a power-law distribution for the clone-size dis-tribution, with exponent − − τ / θ . This exponent hasbeen experimentally measured to be ≈ −
2, which leavesus with a single parameter τ , and θ = τ /
2. An exampleof simulated evolution of Immprint with time is shownin Fig. 2b. The highlighted histogram represents a datapoint at two years obtained from [38]. While a fit is pos-sible for this specific individual, the τ parameter is notuniversal, and we expect it to vary between individuals,especially as a function of age. In Fig. 2c we explore arange of reasonable values for the clone turn-over rate τ (from 6 months to 10 years), and their effect on thestability of Immprint. We observe that for most individ-uals, bar exceptional events, Immprint should conserveits accuracy for years or even decades.In summary, we demonstrated that the T-cells presentin small blood samples provide a somatic and long-lived barcode of human individuality, which is robustto immune challenges and stable over time. Unlikegenome sequencing, repertoire sequencing can discrim-inate monozygotic twins with the same accuracy as un-related individuals. However, a person’s unique immunefingerprint can be completely wiped out by a hematopoi-etic stem cell transplant [39]. Immprint is implemented ina python package and webapp (see Methods) allowing theuser to determine the autologous or heterologous originof a pair of repertoires. Beyond identifying individuals,the tool could be used to check for contamination or la-belling errors between samples containing TCR informa-tion. The repertoire information used by Immprint canbe garnered not only from RepSeq experiments, but alsofrom RNA-Seq experiments, which contain thousands ofimmune receptor transcripts [40, 41]. Relatively smallsamples of immune repertoires are enough to uniquelyidentify an individual even among twins, with potentialforensics applications. At the same time, unlike geneticdata from genomic or mRNA sequencing, Immprint pro-vides no information about kin relationships, very muchlike classical fingerprints, and avoids privacy concernsabout disclosing genetic information shared with nonconsenting relatives. Acknowledgments
The study was supported by the European ResearchCouncil COG 724208. A.A.M and M.V.P. are supportedby RSF-20-15-00351.
Days postvaccination N u m b e r o f s h a r e d s e qu e n ce s , S Donors
P1P2Q1Q2S1S2 a.
10 50 100 500Number of shared sequences, S P r opo r ti on year 0year 2year 4year 6year 8year 2 (data) b. S / S t = τ = 0 . yr τ = 1 yr τ = 4 yrs τ = 10 yrsData c. FIG. 2: a) Evolution of S during vaccination, between a sample taken at day 0 (vaccination date) and at a later timepoint.Each color represents a different individual. Each pair timepoint/individual has two biological replicates. The dashed linerepresents the threshold value. b) Evolution of S between a sample taken at year 0 and a later timepoint. Blue histogramsshow theoretical estimates, and the red histogram corresponds to a real dataset. c) Evolution of the (normalized) mean of S as a function of time for different values of the turnover rate τ . The dashed line represents the threshold value divided by thesmallest value of S t =0 in the data.[1] Homer N, et al. (2008) Resolving Individuals Contribut-ing Trace Amounts of DNA to Highly Complex MixturesUsing High-Density SNP Genotyping Microarrays. PLOSGenetics
ACM computing surveys
Proc Natl Acad Sci USA
Current Opinion inImmunology αβ T cell receptorsas predictors of health and disease.
Cellular & molecularimmunology
Genome medicine
Annual Review of Immunology αβ T cell and B cell receptor repertoires.
Current Opinion in Immunology arXiv:1604.00487 [q-bio] .[11] Qi Q, et al. (2014) Diversity and clonal selection in thehuman T-cell repertoire.
Proceedings of the NationalAcademy of Sciences of the United States of America
Jour-nal of Theoretical Biology
Proceedings ofthe National Academy of Sciences
Na-ture Communications bioRxiv p 2020.01.08.899682.[16] Venturi V, Price DA, Douek DC, Davenport MP (2008)The molecular basis for public T-cell responses?
NatureReviews Immunology
Nature Genetics
Proceedings of the Na- tional Academy of Sciences p 201809642.[19] Zvyagin IV, et al. (2014) Distinctive properties of identi-cal twins’ TCR repertoires revealed by high-throughputsequencing.
Proceedings of the National Academy of Sci-ences of the United States of America α / β -chain pairing in repertoire formation of iden-tical twins. Proceedings of the National Academy of Sci-ences
Immuno-logical Reviews
TheJournal of Immunology
Journal of Experimental Medicine
The Journal of Im-munology α and TCR β chains at the single-cell level in mice. Journal of ClinicalInvestigation
Genome Medicine αβ chain pairing shapes the T cell repertoire. bioRxiv:213462 .[28] Howie B, et al. (2015) High-throughput pairing of Tcell receptor α and β sequences. Science TranslationalMedicine Aβ T-cell receptor.
PLOS Computational Bi-ology
Journal of Virology
Cell Reports
Science Translational Medicine
Vaccine
Im-munological Reviews
The Journal of Immunology
Proceedings of the National Academy of Sciences eLife
BMC Immunology
Leukemia
Nature Genetics
Nature Biotechnology
Nature Methods
Nature Methods
Bioinformatics
PLoS Com-putational Biology
Science
Fron-tiers in Immunology
I. METHODSDatasets & Pre-processing
We use four independant RepSeq datasets in this study: (i) genomic DNA from Peripheral blood mononuclear cells(PBMCs) from 656 healthy donors [17]; (ii) cDNA of PBMCs sampled from three pairs of twins, before and after ayellow-fever vaccination [18]; (iii), (iv) two longitudinal studies of healthy adults [22, 38] .CDR3 nucleotide sequences were extracted with MIGEC [42] (for the second dataset) coupled with MiXCR [43].We also extract the frequency of reads from the three datasets. The non-productive sequences were discarded (out-of-frame, non-functional V gene, or presence of a stop codon). The generation probability ( P gen ) was computed usingOLGA [44], with the default TCR β model. The frequency of each clone was estimated through the number of reads,which we use as an imperfect proxy for the number of cells.The preprocessing code is distributed on the Git repository associated with the paper. We also developed acommand-line tool ( https://github.com/statbiophys/Immprint ) that discriminates between sample origins, anda companion webapp ( https://immprint.herokuapp.com ). Discrimination scores
To discriminate between the autologous and heterologous scenarios, we introduce a log-likelihood ratio test betweenthe two possibilities: I = (cid:88) s ln P ( y ( s ) , y ( s ) | autologous) P ( y ( s ) , y ( s ) | heterologous) , (2)where y ( s ) = 1 if the sequence s is found in sample 1, and 0 otherwise; likewise y ( s ) = 1 if s is in sample 2. Thesum runs over all potential sequences s , including unseen ones. To be present in a sample, a sequence s first has to bepresent in the repertoire. This occurs with probability 1 − (1 − p ( s )) N c , where N c is the total number of clonotypesin the repertoire, and p ( s ) is the probability of occurence of sequence s (resulting from generation and selection, seebelow). Second, it must be picked in a sample of size M , with probability 1 − (1 − f ) M ≈ M f (assuming
M f (cid:28) f , which is distributed according to the clone size distribution ρ ( f ). We checked that f ( s )and P gen ( s ) were not correlated (Fig. S3). Then one can write P ( y ( s ) = 1 , y ( s ) = 1 | autologous) ≈ (cid:16) − e − N c p ( s ) (cid:17) M M (cid:90) df ρ ( f ) f , (3) P ( y ( s ) = 1 , y ( s ) = 0 | autologous) ≈ (cid:16) − e − N c p ( s ) (cid:17) M N c and 1 ↔ , (4) P ( y ( s ) = 0 , y ( s ) = 0 | autologous) ≈ − (cid:16) − e − N c p ( s ) (cid:17) M + M N c , (5)where we’ve used (cid:82) df ρ ( f ) f = 1 /N c . For the heterologous case the probability factorizes as: P ( y ( s ) , y ( s ) | heterologous) = P ( y ( s )) P ( y ( s )) , (6)with P a ( y a ( s ) = 1) ≈ (cid:16) − e − N c p ( s ) (cid:17) M a N c , a = 1 , . (7)Since only the term y ( s ) = y ( s ) = 1 (shared sequences) is different between the autologous and heterologous cases,we obtain: I = (cid:88) shared s (cid:104) ln( N c (cid:104) f (cid:105) ) − ln (cid:16) − e − N c p ( s ) (cid:17)(cid:105) . (8)Further assuming N c p ( s ) (cid:28)
1, and p ( s ) = P gen ( s ) q − (where q accounts for selection [21] and P gen ( s ) is the probabilityof sequence generation [14]), the score simplifies to Eq. 1, with γ = − ln( qN c (cid:104) f (cid:105) ) = ln( q − (cid:104) f (cid:105) / (cid:104) f (cid:105) ). The factor γ depends on unknown parameters of the model, but can be estimated assuming a power-law for the clone sizedistribution [45], ρ ( f ) ∝ f − extending from f = 10 − to f = 0 .
01, and q = 0 .
01 [21], yielding γ ≈ . γ to minimize the AUROC, yielding γ ≈
15 (SI Fig. S4). Since performance degradesquickly for larger values, we conservatively set γ = 12. Estimating mean scores from RepSeq datasets
To estimate the autologous S and I of two samples of size M and M in the absence of true replicates, we computedtheir expected values from a single dataset containing N reads, from which two random subsamples of sizes M and M were taken. The mean value of S is equal to (cid:104)S(cid:105) = (cid:80) s (1 − (1 − f ( s )) M )(1 − (1 − f ( s )) M ), where f ( s ) is the true(and unknown) frequency of sequence s . A naive estimate of (cid:104) mS (cid:105) may be obtained by repeatly resampling subsetsof sizes M and M from the observed repertoire, calculate S for each draw, and average. One get the same result byreplacing f ( s ) by ˆ f s = n ( s ) /N in the previous formula, where n ( s ) is the number of s reads in the full dataset, and N = (cid:80) s n ( s ). However, this naive estimate leads to a systematic overestimate of the sharing (visible when comparedwith biological replicates, see Fig. S5), simply because this procedure overestimates the probability of resampling raresequences, in particular singletons whose true frequency may be much lower that 1 /N . A similar bias occurs whencomputing I . To correct for this bias, we look for a function h ( n ) that satisfies for all f : (cid:104) h ( n ) (cid:105) ≡ (cid:88) n (cid:18) Nn (cid:19) f n (1 − f ) N − n h ( n ) = (1 − (1 − f ) M ) (1 − (1 − f ) M ) , (9)so that (cid:104)S(cid:105) and (cid:104)I(cid:105) can be well approximated by: (cid:104)S(cid:105) ≈ (cid:88) s h ( n ( s )) , (10) (cid:104)I(cid:105) ≈ − (cid:88) s h ( n ( s )) [ln(1 /P gen ( s )) − γ ] . (11)Expanding the right-hand side of Eq. 9 into 4 terms, we find that h ( n ) = 1 − g M ( n ) − g M ( n ) + g M + M ( n ) satisfiesEq. 9 provided that: (cid:88) n (cid:18) Nn (cid:19) f n (1 − f ) N − n g M ( n ) = (1 − f ) M . (12)Under the change of variable x = f / (1 − f ), the expression becomes: (cid:88) n (cid:18) Nn (cid:19) x n g M ( n ) = (1 + x ) N − M = (cid:88) n (cid:18) N − Mn (cid:19) x n . (13)Identifying the polynomial coefficients in x n on both sides yields: g M ( n ) = (cid:18) N − Mn (cid:19)(cid:30)(cid:18) Nn (cid:19) . (14)These corrected estimates agree with the direct estimates using biological replicates (Fig. S5).Similarly, (cid:104)S(cid:105) and (cid:104)I(cid:105) in heterologous samples can be estimated using: (cid:104)S(cid:105) ≈ (cid:88) s [1 − g M ( n ( s ))][1 − g M ( n (cid:48) ( s ))] , (15) (cid:104)I(cid:105) ≈ (cid:88) s [1 − g M ( n ( s ))][1 − g M ( n (cid:48) ( s ))] [ln(1 /P gen ( s )) − γ ] . (16)where n ( s ) and n (cid:48) ( s ) are the empirical counts of sequence s in the two samples. Theoretical upper bound on heterologous scores
When the two samples were extracted from two different people (heterologous scenario), we can use the universalityof the recombination process to give upper bounds on the values of S and I . These bounds are represented by thedashed lines in Fig1c). If two samples of respectively M and M unique sequences are extracted from two differentindividuals, the number of shared sequences between them is given by [21]: (cid:104)S(cid:105) heterologous ≤ (cid:88) s (cid:16) − (1 − p ( s )) M (cid:17) (cid:16) − (1 − p ( s )) M (cid:17) (cid:47) M M (cid:88) s p ( s ) = M M (cid:104) p ( s ) (cid:105) . (17) p ( s ) is the probability of finding a sequence s in the blood. Following [21], we make the approximation p ( s ) = P gen ( s ) q − , where the q = 0 .
01 factor is the probability that a generated sequences passes selection. Then (cid:104) p ( s ) (cid:105) canbe estimated from the mean over generated sequences. Similarly, we can estimate I as (cid:104)I(cid:105) heterologous (cid:47) M M (cid:88) s p ( s ) [ln (1 /P gen ( s )) − γ ] = − M M (cid:104) p ( s )[ γ + ln( qp ( s ))] (cid:105) , (18)which is also estimated from the mean over generated sequences. Error rate estimates
To make the quantitative predictions shown in Fig. 1, we need to constrain the tail behavior of the distributions of S and I , for the two scenarios.The S statistic can be rewritten as a sum of Bernouilli variables over all possible sequences, each with a parametercorresponding to its probability of being present in both samples, either in the autologous or the heterologous case.Therefore S is a Poisson binomial distribution, a sum of independent Bernouilli variables with potentially differentparameters. The variance and tails of that distribution are bounded by those of the Poisson distribution with thesame mean, denoted by m a for the autologous case, and m h for the heterologous case (Fig. S6).Thanks to that inequality, the rates of false negatives and false positives for a given threshold r are bounded by: P ( S < r | autologous) ≤ Q ( r + 1 , m a ) , P ( S > r | heterologous) ≤ − Q ( r + 1 , m h ) , (19)where Q is the regularized gamma function, which appears in the cumulative distribution function of the Poissondistribution. The mean autologous score m a is estimated from experimental data: we use the smallest value of (cid:104)S(cid:105) inthe Emerson dataset and Eq. 10. To compute m h , we use the semi-theoretical prediction made using the universalityof the recombination process, Eq. 17.Similarly, I can be viewed as a sum of S independent random variables, all following the distribution of ln(1 /P gen ) − γ . However, this distribution differs in the two scenarios. Sequences shared between more than one donor have anhigher probability of being generated, their ln( P gen ) distribution has higher mean and smaller variance (Fig. S7).The sum is composed of a relatively large number of variables in most realistic scenarios. Hence, we rely on thecentral limit theorem to approximate it by a normal distribution, of mean and variance proportional to S . Explicitly: P ( I < r | autologous) = 12 ∞ (cid:88) S =0 ( m a ) S e − m a S ! (cid:32) (cid:32) r − S(cid:104) ln(1 /P gen ) − γ (cid:105) (cid:112) S Var[ln(1 /P gen ) − γ ] (cid:33)(cid:33) , (20) P ( I > r | heterologous) = 12 ∞ (cid:88) S =0 ( m h ) S e − m h S ! (cid:32) − erf (cid:32) r − S(cid:104) ln(1 /P gen ) − γ (cid:105) shared (cid:112) S Var[ln(1 /P gen ) − γ ] shared (cid:33)(cid:33) . (21)The AUROC are computed based on these estimates, by numerically integrating the true positive rate P ( S , I We use the model of Ref. [36] to describe the dynamics of individual T- or B-cell clone frequencies f under afluctuating growth rate reflecting the changing state of the environment and the random nature of immune stimuli: dfdt = (cid:20) − τ + 12 θ + 1 √ θ η ( t ) (cid:21) f ( t ) , (22)where η ( t ) is a Gaussian white noise with (cid:104) η ( t ) (cid:105) = 0 and (cid:104) η ( t ) η ( t (cid:48) ) (cid:105) = δ ( t − t (cid:48) ).With the change of variable x = ln( f ), these dynamics simplify to a simple Brownian motion in log-frequency: ∂ t x = − τ − + θ − / η ( t ). In that equation, τ appears as the decay rate of the frequency, while θ is the timescaleof the noise, interpreted as the typical time it takes for the frequency to rise or fall by a logarithmic unit owing tofluctuations. Considering a large population of clone, each with their independent frequency evolving according toEq. 22, and a source term at small f corresponding to thymic exports, one can show that the steady-state probabilitydensity function of f follows a power-law [36], ρ ( f ) ∝ f − α , with exponent α = 1 + 2 θ/τ . α was empirically found tobe ≈ θ ≈ τ . The turn-over time τ is unknown, andwas varied from 1 / f ( s, 0) = ˆ f ( s, 0) = n ( s, /N from the analysed datasets. A sample of size M was randomlyselected with respect to these frequencies, and the frequencies of the clones captured in that sample were then evolvedwith a time-step of 2 days using Euler-Maruyama’s method, which is exact in the case of Brownian motion. Cloneswith frequencies falling below 10 − (corresponding to a single cell in the organism) were removed. At each time t > S with the formula (cid:80) s (1 − (1 − f ( s, t )) M ) where the sum runs over the sequencescaptured in the initial sample. un i qu e c l ono t yp e s S h a nnon i nd e x mean of S (M=5000) S i m p s on i nd e x 1e 3 0 200 400 mean of S (M=5000) r ea d s FIG. S1: Comparison between the mean of S (autologous case), and three common diversity measures: the number of uniquesequences found in the dataset (top left), the Shannon index, − (cid:80) ˆ f s ln ˆ f s (top right), the Simpson index (bottom left), andthe total number of reads in each datasets (bottom right). All the diversity measures show a strong correlation with S , butthe correlation with the sequencing depth is low. − − − − False positive rate − − − − − F a l s e n e g a ti v e r a t e SI M = 20 M = 50 M = 100 FIG. S2: Detection Error Trade-off (DET) graph for both summary statistics, between a large sample (full dataset, M = 10 )and a smaller one, of size M = M . m ea n o f l n ( P g e n ) AllShared 0 200 400rank242220 m ea n o f l n ( P g e n ) SharedShared, CMV+ donorsShared, CMV donors FIG. S3: Left: Mean value of P gen as a function of the rank of the clonotype, for generic sequences (blue) and sequencesshared between more than two donors (orange). The mean stays flat indicating that the probability of being generated doesnot generally depend on the clonotype size. There is an exception (black rectangle), shown as a close-up on the right panel.The top twenty clones, when shared between donors, have a smaller probability of being generated than expected by chance.This difference is likely to be driven by convergent selection against common pathogens, since CMV positive donors show amore prononced effect than CMV negative ones. − − − − − − AU R O C ISI ( γ = 0) − − − − − − AU R O C SI ( γ = 0) I ( γ = 15) FIG. S4: Left panel: AUROC (Area Under Receiver Operating Characteristic) of I , as a function of γ ( M = M = M = 5000).We observe an optimum near γ = 15. Right panel: AUROC as a function of M , for S , I ( γ = 0), and I ( γ = 15). 75 100 125 150 175 200 225S between biological replicates100150200 S b e t w ee n nu m e r i ca l r e p li ca t e s Unbiased approachNaive approach FIG. S5: Naive and corrected estimates of the autologous S from single datasets, versus its values computed using true biologicalreplicates from Ref. [18]. Number of shared sequences P r opo r ti on Poisson distributionTrue distribution FIG. S6: Comparison between the distribution of S obtained by computationally and repeatedly downsampling a singlerepertoire from Ref. [17] with M = 5 , 000 (histogram), and a Poisson distribution of the same mean (full line). 120 100 80 60 40 20 log( p gen ) P r opo r ti on AllShared FIG. S7: Distribution of ln( P gengen