[PDF] NoisET: Noise learning and Expansion detection of T-cell receptors with Python

Abstract

High-throughput sequencing of T- and B-cell receptors makes it possible to track immune repertoires across time, in different tissues, in acute and chronic diseases and in healthy individuals. However quantitative comparison between repertoires is confounded by variability in the read count of each receptor clonotype due to sampling, library preparation, and expression noise. We present an easy-to-use python package NoisET that implements and generalizes a previously developed Bayesian method. It can be used to learn experimental noise models for repertoire sequencing from replicates, and to detect responding clones following a stimulus. The package was tested on different repertoire sequencing technologies and datasets. Availability: NoisET is freely available to use with source code at github.com/statbiophys/NoisET.

Full PDF

NNoisET: Noise learning and Expansion detectionof T-cell receptors with Python

Meriem Bensouda Koraichi, Maximilian Puelma Touzel, Thierry Mora, ∗ and Aleksandra M. Walczak ∗ Laboratoire de physique de l’ ´Ecole normale sup´erieure, CNRS, PSL University,Sorbonne Universit´e, and Universit´e de Paris, 75005 Paris, France MILA, University of Montreal, Montreal, Canada

High-throughput sequencing of T- and B-cell receptors makes it possible to track immune reper-toires across time, in diﬀerent tissues, in acute and chronic diseases and in healthy individuals.However quantitative comparison between repertoires is confounded by variability in the read countof each receptor clonotype due to sampling, library preparation, and expression noise. We presentan easy-to-use python package

NoisET that implements and generalizes a previously developedBayesian method. It can be used to learn experimental noise models for repertoire sequencing fromreplicates, and to detect responding clones following a stimulus. The package was tested on diﬀerentrepertoire sequencing technologies and datasets.Availability:

NoisET is freely available to use with source code at github.com/statbiophys/NoisET . INTRODUCTION

High-Throughput Repertoire Sequencing (RepSeq) ofT and B cell receptors (TCR and BCR) [3] enables tostudy the dynamics of lymphocytes at the resolution ofsingle clones, by comparing their concentrations acrosstimepoints or conditions. To detect biologically relevantclones, one must be able to distinguish true diﬀerencesin clone frequencies from experimental noise. This vari-ability has two sources. First, laboratories use varioussequencing and sample preparation protocols using ei-ther gDNA or cDNA (with or without unique molecularidentiﬁers), with diﬀerent outcomes in terms of ampliﬁ-cation bias and errors [2, 4]. This makes it diﬃcult toreliably estimate TCR or BCR clonal frequencies fromsequence counts. Second, one must translate immune in-formation contained in a few milliliters of blood to thewhole repertoire. To describe these sources of variability,one needs a probabilistic approach.Touzel et al. [9] developed a statistical model to iden-tify responding clones using sequence counts in longitu-dinal RepSeq data. This model captures features of arepertoire response to a single, strong perturbation (e.g.yellow fever vaccination), giving rise to a fast transientresponse dynamics. The method was proposed as an al-ternative to commonly used tests such as Fisher’s exacttest [1] or beta binomial models [8]. Its main innovationis to account for the diﬀerent sources of biological andexperimental noise in the clone count measurements ina Bayesian way, allowing for a more reliable detectionof expanded or contracted clones. This note introduces

NoisET (Noise sampling learning and Expansion detec-tion of TCRs), an easy-to-use python package that imple-ments this method and extends it to datasets of diverseorigin describing the clonal repertoire response to acuteinfections.

FEATURES

NoisET has two main functions: (1) inference of astatistical null model of sequence counts and variabil-ity, using replicate RepSeq experiments; (2) detection ofresponding clones to a stimulus by comparison of tworepertoires taken at two timepoints. The second func-tion requires a noise model, which is given as an outputof the ﬁrst function. Both functions require two lists ofsequence counts associated to each TCR or BCR presentin the repertoires: from replicate experiments for the ﬁrstfunction (Fig. 1a), and from repertoires before and afterthe stimulus for the second function (Fig. 1b).When using the ﬁrst function, the user must pick thetype of noise model, which describes how the sequencecount in the RepSeq sample depends probabilistically onits true frequency in the blood. Choices are: a Pois-son distribution, a negative binomial distribution, or atwo-step model [9]. Once the parameters have beenlearnt (Maximum Likelihood Estimation optimization al-gorithm), a generation tool can be applied to qualita-tively check the agreement between data and model forreplicates (Fig 1a). We also successfully learnt a nullmodel from gDNA data [8], which is included in the pack-age example notebook.To use the second function to detect responding clono-types, the user provides, in addition to the two datasetsto be compared, two sets of experimental noise parame-ters learnt at both times using the ﬁrst function. Whenreplicates are not available for each time point or donor,a common null model may be used for both timepoints.This should be done with caution, since even if both sam-ples are produced with the same technology for the samedonor, the sequencing depth and distribution of clone fre-quencies may vary between timepoints. Finally the userprovides two thresholds: one for the posterior probabil-ity above which a clone is labeled as responding, and onefor the median log-fold frequency diﬀerence above which a r X i v : . [ q - b i o . GN ] F e b (a)(b) model(c) i n f e r g e n e r a t e FIG. 1: (a)

Scatter plots of sequence counts from two bi-ological replicates from Pogorelyy et al. [7] (left).

NoisET learns a statistical model of sequence frequencies and observedcounts from these data (with Negative Binomial samplingnoise model), which can then be used to generate realisticsynthetic data (right). (b)

Scatter plot of contracted clonesfrom day 15 to day 85 after a mild COVID-19 infection [6].Clones detected as contracting by

NoisET are shown in pur-ple. (c)

Number of responding clones detected by

NoisET (using a two step noise model) for 3 studies: donors M andW (with both α and β TCR chains) in response to COVID-19 between days 15 and 85 [6]; 6 twin donors (S1 throughQ2, only β chain) between days 0 and 15 following yellow-fever vaccination [7]; and yellow-fever ﬁrst (M) and secondvaccination (M and P) [5]. detection is allowed. The output is a CSV ﬁle contain-ing a table of putative responding clones. The resultis illustrated in Fig. 1b, which shows contracted clones(purple points) detected from day 15 to day 85 from amild COVID-19 infection [6]. Fig. 1c reports the number of responding clonotypes detected by NoisET applied tothree diﬀerent datasets revealing COVID-19 and yellowfever vaccine TCR response dynamics.All functions are explained in a well-documentedREADME and notebooks displayed on the Github repos-itory. DISCUSSION

NoisET is designed as an easy-to-use package to learnthe noisy statistics of sequence counts and to detect re-sponding clones to a stimuli as reliably as possible. Itcaptures the experimental and biological noise for bothRNAseq and gDNAseq replicate technologies. Althoughthe package has been tested on diverse datasets, choosingand using the adequate statistical null model should bedone with caution. Among the diﬀerent types of noisemodel oﬀered, the negative binomial noise model is rec-ommended to start the analysis as its running time isshorter than the two step model, while retaining the abil-ity to account for arbitrary noise amplitudes. So far,

NoisET has been used to study the short time scale dy-namics for acute infections, but could also be used tocompare bulk repertoires with selected repertoires de-rived from functional or cultured assays [1]. For longertime scales, the dynamics of lymphocyte populationsshould be modeled to best describe slow global repertoirechanges that cannot be attributed to a single stimulus.

Ackowledgements

This work was partially supported by the EuropeanResearch Council Consolidator Grant n. 724208 andANR-19-CE45-0018 “RESP-REP” from the Agence Na-tionale de la Recherche. ∗ Corresponding authors. These authors contributedequally.[1] Balachandran et al. (2017) Identiﬁcation of unique neoanti-gen qualities in long-term survivors of pancreatic cancer.

Nature , (7681), 512–516.[2] Barennes et al. (2020) Benchmarking of T cell receptorrepertoire proﬁling methods reveals large systematic bi-ases. Nature Biotechnology online ahead of print.[3] Benichou et al. (2012) Rep-Seq: uncovering the immuno-logical repertoire through next-generation sequencing.

Im-munology , (3), 183–191.[4] Heather et al. (2017) High-throughput sequencing of the T-cell receptor repertoire: pitfalls and opportunities. Brief-ings in Bioinformatics , (4), 554–565.[5] Minervina et al. (2020) Primary and secondary anti-viralresponse captured by the dynamics and phenotype of in-dividual T-cell clones. eLife , e53704. [6] Minervina et al. (2021) Longitudinal high-throughputTCR repertoire proﬁling reveals the dynamics of T-cellmemory formation after mild COVID-19 infection. eLife , e63502.[7] Pogorelyy et al. (2018) Precise tracking of vaccine-responding t cell clones reveals convergent and person-alized response in identical twins. Proceedings of the Na-tional Academy of Sciences , (50), 12704–12709. [8] Rytlewski et al. (2019) Model to improve speciﬁcity foridentiﬁcation of clinically-relevant expanded T cells in pe-ripheral blood. PLOS ONE , (3), e0213684.[9] Touzel et al. (2020) Inferring the immune response fromrepertoire sequencing. PLOS Computational Biology ,16