[PDF] Data-driven design of targeted gene panels for estimating immunotherapy biomarkers

Abstract

We introduce a novel data-driven framework for the design of targeted gene panels for estimating exome-wide biomarkers in cancer immunotherapy. Our first goal is to develop a generative model for the profile of mutation across the exome, which allows for gene- and variant type-dependent mutation rates. Based on this model, we then propose a new procedure for estimating biomarkers such as Tumour Mutation Burden and Tumour Indel Burden. Our approach allows the practitioner to select a targeted gene panel of a prespecified size, and then construct an estimator that only depends on the selected genes. Alternatively, the practitioner may apply our method to make predictions based on an existing gene panel, or to augment a gene panel to a given size. We demonstrate the excellent performance of our proposal using an annotated mutation dataset from 1144 Non-Small Cell Lung Cancer patients.

Full PDF

DData-driven design of targeted gene panels forestimating immunotherapy biomarkers

Jacob R. Bradley and Timothy I. Cannings

School of Mathematics, University of Edinburgh

Abstract

We introduce a novel data-driven framework for the design of targeted gene panelsfor estimating exome-wide biomarkers in cancer immunotherapy. Our ﬁrst goal is todevelop a generative model for the proﬁle of mutation across the exome, which allowsfor gene- and variant type-dependent mutation rates. Based on this model, we thenpropose a new procedure for estimating biomarkers such as Tumour Mutation Burdenand Tumour Indel Burden. Our approach allows the practitioner to select a targetedgene panel of a prespeciﬁed size, and then construct an estimator that only dependson the selected genes. Alternatively, the practitioner may apply our method to makepredictions based on an existing gene panel, or to augment a gene panel to a givensize. We demonstrate the excellent performance of our proposal using an annotatedmutation dataset from 1144 Non-Small Cell Lung Cancer patients.

Keywords: cancer, gene panel design, targeted sequencing, tumour indelburden, tumour mutation burden.

It has been understood for a long time that cancer, a disease occurring in many distincttissues of the body and giving rise to a wide range of presentations, is initiated and drivenby the accumulation of mutations in a subset of a person’s cells (Boveri, 2008). Since thediscovery of Immune Checkpoint Blockade (ICB) (Ishida et al., 1992; Leach et al., 1996),there has been an explosion of interest in cancer therapies targeting immune response andICB therapy is now widely used in clinical practice (Robert, 2020). ICB therapy works bytargeting natural mechanisms (or checkpoints ) that disengage the immune system, for exam-ple the proteins Cytotoxic T Lymphocyte Associated protein 4 (CTLA-4) and ProgrammedDeath Ligand 1 (PD-L1) (Buchbinder and Desai, 2016). Inhibition of these checkpoints canpromote a more aggressive anti-tumour immune response (Pardoll, 2012), and in some pa-tients this leads to long-term remission (Gettinger et al., 2019). However, ICB therapy isnot always eﬀective (Nowicki et al., 2018) and may have adverse side-eﬀects, so determiningwhich patients will beneﬁt in advance of treatment is vital.Exome-wide prognostic biomarkers for immunotherapy are now well-established – in par-ticular, Tumour Mutation Burden (TMB) is used to predict response to immunotherapy For their work on ICB, James Allison and Tasuku Honjo received the 2018 Nobel Prize for Physiol-ogy/Medicine (Ledford et al., 2018). a r X i v : . [ q - b i o . GN ] F e b Zhu et al., 2019; Cao et al., 2019). TMB is deﬁned as the total number of non-synonymousmutations occurring throughout the tumour exome, and can be thought of as a proxy forhow easily a tumour cell can be recognised as foreign by immune cells (Chan et al., 2019).However, the cost of measuring TMB using Whole Exome Sequencing (WES) (Sboner et al.,2011) currently prohibits its widespread use as standard-of-care. Sequencing costs, bothﬁnancial and in terms of the time taken for results to be returned, are especially problematicin situations where high-depth sequencing is required, such as when utilising blood-basedCirculating Tumour DNA (ctDNA) from liquid biopsy samples (Gandara et al., 2018). Thesame issues are encountered when measuring more recently proposed biomarkers such asTumour Indel Burden (TIB) (Wu et al., 2019b; Turajlic et al., 2017), which counts thenumber of frameshift insertion and deletion mutations. There is, therefore, demand for cost-eﬀective approaches to estimate these biomarkers (Fancello et al., 2019; Golkaram et al.,2020).In this paper we propose a novel, data-driven method for biomarker estimation, based ona generative model of how mutations arise in the tumour exome. More precisely, we modelmutation counts as independent Poisson variables, where the mean number of mutationsdepends on the gene of origin and variant type, as well as the Background Mutation Rate(BMR) of the tumour. Due to the ultrahigh-dimensional nature of sequencing data and thefact that in many genes mutations arise purely according to the BMR, we use a regularisationpenalty when estimating the parameters of the model. In addition, this identiﬁes a subsetof genes that are mutated above or below the background rate. Our model facilitates theconstruction of a new estimator of TMB, based on a weighted linear combination of thenumber of mutations in each gene. The vector of weights is chosen to be sparse (i.e. havemany entries equal to zero), so that our estimator of TMB may be calculated using onlythe mutation counts in a subset of genes. In particular, this allows for accurate estimationof TMB from a targeted gene panel, where the panel size (and therefore the cost) maybe determined by the user. We demonstrate the excellent practical performance of ourframework using a Non-Small Cell Lung Cancer (NSCLC) dataset (Chalmers et al., 2017),and include a comparison with existing state-of-the-art approaches for estimating TMB.Moreover, since our model allows variant type-dependent mutation rates, it can be adaptedeasily to predict other biomarkers, such as TIB. Finally, our method may also be used incombination with an existing targeted gene panel. In particular, we can estimate a biomarkerdirectly from the panel, or ﬁrst augment the panel and then construct an estimator.Due to its emergence as a biomarker for immunotherapy in recent years, a variety ofgroups have considered methods for estimating TMB. A simple and common way to esti-mate TMB is via the proportion of mutated codons in a targeted region. Budczies et al.(2019) investigate how the accuracy of predictions made in this way are aﬀected by the sizeof the targeted region, where mutations are assumed to occur at uniform rate throughout thegenome. More recently Yao et al. (2020) modelled mutations as following a negative binomialdistribution while allowing for gene-dependent rates, which are inferred by comparing non-synonymous and synonymous mutation counts. In contrast, our method does not requiredata including synonymous mutations. Where they are included, we do not assume thatsynonymous mutations occur at a uniform rate throughout the genome, giving us the ﬂexi-bility to account for location-speciﬁc eﬀects on synonymous mutation rate such as chromatinconﬁguration (Makova and Hardison, 2015) and transcription-dependent repair mechanisms(Fong et al., 2013). Linear regression models have been used for both panel selection (Lyu2t al., 2018) and for biomarker prediction (Guo et al., 2020). A review of some of the issuesarising when dealing with targeted panel-based predictions of TMB biomarkers is given byWu et al. (2019a). Finally, we are unaware of any methods for estimating TIB from targetedgene panels.The remainder of the paper is as follows. In Section 2, we introduce our data sources,and provide a detailed description of our methodological proposal. Experimental resultsare given in Section 3 and we conclude in Section 4. Finally, we also provide an R package ICBioMark (Bradley and Cannings, 2021) which implements the methodology and reproducesthe experimental results in the paper.

Our methodology can be applied to any annotated mutation dataset obtained by WES. Todemonstrate our proposal we make use of the NSCLC dataset produced by Campbell et al.(2016), which contains data from 1144 patient-derived tumours. For each sample in thisdataset we have the genomic locations and variant types of all mutations identiﬁed. At thetime of the study, the patients had a variety of prognoses and smoking histories, were agedbetween 39 and 90, 41% were female and 59% were male; see Figure 1. In Figure 2A wesee that mutations counts are distributed over a very wide range, as is the case in manycancer types (Chalmers et al., 2017). For simplicity, we only consider seven nonsynonymousvariant types: missense mutations (which are the most abundant), nonsense mutations,frameshift insertions/deletions, splice site mutations, in-frame insertions/deletions, nonstopmutations and translation start site mutations. We present the frequencies of these mutationtypes in Figure 2B. Frameshift insertion/deletion (also known as indel) mutations are ofparticular interest when predicting TIB, but contribute only a small proportion ( < A : Violin plotsof age for patients, stratiﬁed by sex. B : Stacked bar chart of patients’ smoking histories,shaded according to cancer stage diagnosis.It is useful at this point to introduce the notation used throughout the paper. The set G denotes the collection of genes that make up the exome. For a gene g ∈ G , let (cid:96) g be the3igure 2: Dataset-wide distribution of mutations. A : Violin plot of the distribution of TMBand TIB across training samples. B : The relative frequency of diﬀerent nonsynonymousmutation types.length of g in nucleotide bases, deﬁned by the maximum coding sequence . A gene panelis a subset P ⊆ G , and we write (cid:96) P := (cid:80) g ∈ P (cid:96) g for its total length. We let S denote theset of variant types in our data (e.g. in the dataset mentioned above, S contains the sevenpossible non-synonymous variants). Now, for i = 0 , , . . . , n , let M igs denote the count ofmutations in gene g ∈ G of type s ∈ S in the i th sample. Here the index i = 0 is usedto refer to an unseen test sample for which we would like to make a prediction, while theindices i = 1 , . . . , n enumerate the samples in our training data set. In order to deﬁne theexome-wide biomarker of particular interest, we specify a subset of mutation types ¯ S ⊆ S ,and let T i ¯ S := (cid:88) g ∈ G (cid:88) s ∈ ¯ S M igs , (1)for i = 0 , . . . , n . For example, including all non-synonymous mutation types in ¯ S speciﬁes T i ¯ S as the TMB of sample i , whereas letting ¯ S contain only indel mutations gives TIB.Our main goal is to predict T S based on { M gs : g ∈ P, s ∈ S } , where the panel P ⊆ G has length (cid:96) P satisfying some upper bound. When it is clear from context that we arereferring to the test sample and a speciﬁc choice of biomarker (i.e. ¯ S is ﬁxed), we will simplywrite T in place of T S . We now describe the main statistical model that underpins our methodology. In orderto account for selective pressures and other factors within the tumour, we allow the rate atwhich mutations occur to depend on the gene and type of mutation. Our model also includesa sample-dependent parameter to account for the diﬀering levels of mutagenic exposure oftumours, which may occur due to exogenous (e.g. UV light, cigarette smoke) or endogenous(e.g. inﬂammatory, free radical) factors. The maximum coding sequence is deﬁned as the collection of codons that may be translated for someversion of a gene, even if all the codons comprising the maximum coding sequence are never simultaneouslytranslated. Gene coding lengths are extracted from the

Ensembl database (Yates et al., 2020).

4e model the mutation counts M igs as independent Poisson random variables with mu-tation rates φ igs >

0. More precisely, for i = 0 , , . . . , n , g ∈ G and s ∈ S , we have M igs ∼ Poisson( φ igs ) , (2)where M igs and M i (cid:48) g (cid:48) s (cid:48) are independent for ( i, g, s ) (cid:54) = ( i (cid:48) , g (cid:48) , s (cid:48) ). Further, to model thedependence of the mutation rate on the sample, gene and mutation type, we use a log-linkfunction and let log( φ igs ) = µ i + log( (cid:96) g ) + λ g + ν s + η gs , (3)for µ i , λ g , ν s , η gs ∈ R , where for identiﬁability we set η gs = 0, for some s ∈ S and all g ∈ G .The terms in our model can be interpreted as follows. First, the parameter µ i correspondsto the BMR of the i th sample. The oﬀset log( (cid:96) g ) accounts for a mutation rate that isproportional to the length of a gene, so that a non-zero value of λ g corresponds to increasedor decreased mutation rate relative to the BMR. The parameters ν s and η gs account fordiﬀerences in frequency between mutation types for each gene.The model in (2) and (3) (discounting the unseen test sample i = 0) has n + | S | + | G || S | free parameters and we have n | G || S | independent observations in the training data set. Inprinciple we could attempt to ﬁt our model directly using maximum likelihood estimation.However, we wish to exploit the fact that most genes do not play an active role in thedevelopment of a tumour, and will be mutated approximately according to the BMR. Thiscorresponds to the parameters λ g and η gs being zero for many g ∈ G . We therefore includean (cid:96) -penalisation term applied to the parameters λ g and η gs when ﬁtting our model. Wedo not penalise the parameters ν s or µ i .Writing µ := ( µ , . . . , µ n ), λ := ( λ g : g ∈ G ), ν := ( ν s : s ∈ S ) and η := ( η gs : g ∈ G, s ∈ S ), and given training observations M igs = m igs , we let L ( µ, λ, ν, η ) = n (cid:88) i =1 (cid:88) g ∈ G (cid:88) s ∈ S (cid:16) φ igs − m igs log φ igs (cid:17) be the negative log-likelihood of the model speciﬁed by (2) and (3). We then deﬁne(ˆ µ, ˆ λ, ˆ ν, ˆ η ) = arg min µ,λ,ν,η (cid:110) L ( µ, λ, ν, η ) + κ (cid:16)(cid:88) g ∈ G | λ g | + (cid:88) g ∈ G (cid:88) s ∈ S | η gs | (cid:17)(cid:111) , (4)where κ ≥ λ and ˆ η , which we choose using cross-validation (see Section 2.5 for more detail). We now attend to our main goal of estimating a given exome-wide biomarker for the unseentest sample. Fix ¯ S ⊆ S and recall that we write T = T S . We wish to construct an estimatorof T that only depends on the mutation counts in a gene panel P ⊂ G , subject to a constrainton (cid:96) P . To that end, we consider estimators of the form T ( w ) := (cid:88) g ∈ G (cid:88) s ∈ S w gs M gs , Note that our estimator may use the the full set S of variant types, rather than just those in ¯ S . In otherwords, our estimator may utilise information from every mutation type, not just those that directly constitutethe biomarker of interest. This is important when estimating mutation types in ¯ S that are relatively scarce(e.g. for TIB). w ∈ R | G |×| S | . In the remainder of this subsection we explain how the weights w arechosen to minimise the expected squared error of T ( w ) based on the generative model inSection 2.2.Of course, setting w gs = 1 for g ∈ G and s ∈ ¯ S (and w gs = 0 otherwise) will give T ( w ) = T . However, our aim is to make predictions based on a concise gene panel. If,for a given g ∈ G , we have w gs = 0 for all s ∈ S , then T ( w ) does not depend on themutations in g and therefore the gene does not need to be included in the panel. In order toproduce a suitable gene panel (i.e. with many w gs = 0), we penalise non-zero components of w when minimising the expected squared error. We deﬁne our ﬁnal estimator via a reﬁttingprocedure, which improves the predictive performance by reducing the bias, and is alsohelpful when applying our procedure to panels with predetermined genes.To construct our estimator, note that under our model in (2) we have E M gs = Var( M gs ) = φ gs , and it follows that the expected squared error of T ( w ) is E (cid:2) { T ( w ) − T } (cid:3) = Var( T ( w )) + Var( T ) − T ( w ) , T ) + (cid:2) E { T ( w ) − T } (cid:3) = (cid:88) g ∈ G (cid:88) s ∈ ¯ S (1 − w gs ) φ gs + (cid:88) g ∈ G (cid:88) s ∈ S \ ¯ S w gs φ gs + (cid:16)(cid:88) g ∈ G (cid:88) s ∈ S w gs φ gs − (cid:88) g ∈ G (cid:88) s ∈ ¯ S φ gs (cid:17) . (5)This depends on the unknown parameters µ , λ g , ν s and η gs , the latter three of which arereplaced by their estimates given in (4). It is also helpful to then rescale (5) as follows: writeˆ φ gs = (cid:96) g exp(ˆ λ g + ˆ ν s + ˆ η gs ), and deﬁne p gs := ˆ φ gs (cid:80) g (cid:48) ∈ G (cid:80) s (cid:48) ∈ ¯ S ˆ φ g (cid:48) s (cid:48) = (cid:96) g exp(ˆ λ g + ˆ ν s + ˆ η gs ) (cid:80) g (cid:48) ∈ G (cid:80) s (cid:48) ∈ ¯ S (cid:96) g (cid:48) exp(ˆ λ g (cid:48) + ˆ ν s (cid:48) + ˆ η g (cid:48) s (cid:48) ) . Then let f ( w ) := (cid:88) g ∈ G (cid:88) s ∈ ¯ S p gs (1 − w gs ) + (cid:88) g ∈ G (cid:88) s ∈ S \ ¯ S p gs w gs + K ( µ ) (cid:0) − (cid:88) g ∈ G (cid:88) s ∈ S p gs w gs (cid:1) , where K ( µ ) = exp( µ ) (cid:80) g ∈ G (cid:80) s ∈ ¯ S (cid:96) g exp(ˆ λ g + ˆ ν s + ˆ η gs ). Since f is a rescaled version ofthe error in (5) (with the true parameters λ, ν, η replaced by the estimates ˆ λ, ˆ ν, ˆ η ), we willchoose w to minimise f ( w ).Note that f only depends on µ via the K ( µ ) term, which can be interpreted as a penaltyfactor controlling the bias of our estimator. For example, we may insist that the squaredbias term (1 − (cid:80) g ∈ G (cid:80) s ∈ S p gs w gs ) is zero by setting K ( µ ) = ∞ . In practice, we proposeto choose the penalty K based on the training data; see Section 2.5.At this point f ( w ) is minimised by choosing w to be such that w gs = 1 for all g ∈ G, s ∈ ¯ S ,and w gs = 0 otherwise. As mentioned above, in order to form a concise panel while optimisingpredictive performance, we impose a constraint on the cost of sequencing the genes used inthe estimation. More precisely, for a given w , an appropriate cost is (cid:107) w (cid:107) G, := (cid:88) g ∈ G (cid:96) g { w gs (cid:54) = 0 for some s ∈ S } . L , our goal is to minimise f ( w ) such that (cid:107) w (cid:107) G, ≤ L .In practice this problem is non-convex and so computationally infeasible. As is commonin high-dimensional optimisation problems, we consider a convex relaxation as follows: let (cid:107) w (cid:107) G, := (cid:80) g ∈ G (cid:96) g (cid:107) w g (cid:107) , where w g = ( w gs : s ∈ S ) ∈ R | S | , for g ∈ G , and (cid:107) · (cid:107) is theEuclidean norm. Deﬁne ˆ w ﬁrst-ﬁt ∈ arg min w (cid:8) f ( w ) + κ (cid:107) w (cid:107) G, (cid:9) , (6)where κ ≥ P ⊆ G ,let W P := { w ∈ R | G |×| S | : w g = (0 , . . . ,

0) for g ∈ G \ P } . (7)Let ˆ P := { g ∈ G : (cid:107) ˆ w ﬁrst-ﬁt g (cid:107) > } be the panel selected by the ﬁrst-ﬁt estimator in (6),and deﬁne ˆ w reﬁt ∈ arg min w ∈ W ˆ P (cid:8) f ( w ) (cid:9) . (8)We then estimate T using ˆ T := T ( ˆ w reﬁt ), which only depends on mutations in genes containedin the selected panel ˆ P . The performance of our estimator is investigated in Section 3, forcomparison we also include the performance of the ﬁrst-ﬁt estimator T ( ˆ w ﬁrst − ﬁt ). In practice, when designing gene panels a variety of factors contribute to the choice of genesincluded. For example, a gene may be included due to its relevance to immune response or itsknown association with a particular cancer type. If this is the case, measurements for thesegenes will be made regardless of their utility for predicting exome-wide biomarkers. Whenimplementing our methodology, therefore, there is no additional cost to incorporate observa-tions from these genes into our prediction if they will be helpful. Conversely researchers maywish to exclude genes from a panel, or at least from actively contributing to the estimationof a biomarker, for instance due to technical diﬃculties in sequencing a particular gene.We can accommodate these restrictions by altering the structure of our regularisationpenalty in (6). Suppose we are given (disjoint sets of genes) P , Q ⊆ G to be included andexcluded from our panel, respectively. In this case, we replace ˆ w ﬁrst-ﬁt in (6) withˆ w ﬁrst-ﬁt P ,Q ∈ arg min w ∈ W G \ Q (cid:8) f ( w ) + κ (cid:88) g ∈ G \ P l g (cid:107) w g (cid:107) (cid:9) . (9)Excluding the elements of P from the penalty term means that ˆ w ﬁrst-ﬁt P ,Q (cid:54) = 0 for the genesin P , while restricting our optimisation to W G \ Q excludes the genes in Q by deﬁnition.This has the eﬀect of augmenting the predetermined panel P with additional genes selectedto improve predictive performance. We then perform reﬁtting as described above. Wedemonstrate this procedure by augmenting the TST-170 gene panel in Section 3.4.7 .5 Practical considerations In this section, we discuss some practical aspects of our proposal. Our ﬁrst considerationconcerns the choice of the tuning parameter κ in (4). As is common for the Least AbsoluteShrinkage and Selection Operator (LASSO) estimator in generalised linear regression (see,for example, Michoel (2016) and Friedman et al. (2020)), we will use 10-fold cross-validation.To highlight one important aspect of our cross-validation procedure, recall that we considerthe observations M igs as independent across the sample index i ∈ { , . . . , n } , the gene g ∈ G and the mutation type s ∈ S . Our approach therefore involves splitting the entire set { ( i, g, s ) : i = 1 , . . . , n, g ∈ G, s ∈ S } of size n | G || S | (as opposed to the sample set { , . . . , n } )into 10 folds uniformly at random. We then apply the estimation method in (4) to each ofthe 10 folds separately on a grid of values (on the log scale) of κ , and select the value thatresults in the smallest average deviance across the folds. The model is then reﬁtted usingall the data for this value of κ .The estimated coeﬃcients in (6) depend on the choice of K ( µ ) and κ . As mentionedabove, we could set K ( µ ) = ∞ to give an unbiased estimator, however in practice we foundthat a ﬁnite choice of K ( µ ) leads to improved predictive performance. Our recommendationis to use K ( µ ) = K (max i =1 ,...,n { ˆ µ i } ), where ˆ µ i = log( T i / (cid:80) g,s (cid:96) g exp(ˆ λ g + ˆ ν s + ˆ η gs )) is apseudo-MLE (in the sense of Gong and Samaniego (1981)) for µ i , so that the penalisationis broadly in proportion with the largest values of µ i in the training dataset. The tuningparameter κ controls the size of the gene panel selected in (6): given a panel length L , weset κ ( L ) = max { κ : (cid:96) ˆ P ≤ L } in order to produce a suitable panel.We now comment brieﬂy on some computational aspects of our method. The generativemodel ﬁt in (4) can be solved via coordinate descent (see, for example, Friedman et al., 2010),which has a computational complexity of O ( N | G | | S | ) per iteration. We ﬁt the model 10times, one for each fold in our cross-validation procedure. This is the most computationallydemanding part of our proposal – in our experiments below, it takes approximately an hourto solve on a standard laptop – but it only needs to be carried out once for a given dataset.The convex optimisation problem in (6) can be solved by any method designed for the groupLASSO; see, for example, Yang and Zou (2015). In our experiments in Section 3, we use the gglasso R package (Yang et al., 2020), which takes around 10 minutes to reproduce the plotin Figure 6. Note also that the solutions to (6) and (8) are unique; see, for example, Rothand Fischer (2008, Theorem 1). The last step of our proposal, namely making predictionsfor new test observations based on a selected panel, carries negligible computational cost.Finally we describe a heuristic procedure for producing prediction intervals around ourpoint estimates. In particular, for a given conﬁdence level α ∈ (0 , T L , ˆ T U ] such that P (cid:0) ˆ T L ≤ T ≤ ˆ T U (cid:1) ≥ − α. To that end, let t α := E { ( ˆ T − T ) } /α ,then by Markov’s inequality we have that P ( | ˆ T − T | ≥ t α ) ≤ α . It follows that [ ˆ T − t / α , ˆ T + t / α ] is a (1 − α )-prediction interval for T . Of course, the mean squared error E { ( ˆ T − T ) } deﬁned in (5) depends on the parameters λ, η, ν and µ , which are unknown. Our approach isto utilise the estimates ˆ λ, ˆ η, ˆ ν (see (4)) and replace µ with log( ˆ T / (cid:80) g,s (cid:96) g exp(ˆ λ g + ˆ ν s + ˆ η gs )).While this is not an exact (1 − α )-prediction interval for T , we will see in our experimentalresults in Sections 3.2 and 3.3 that in practice this approach provides intervals with validempirical coverage. 8 Experimental results

In this section we demonstrate the practical performance of our proposal using the datasetfrom Campbell et al. (2016), which we introduced in Section 2.1. Our main focus is the pre-diction of TMB, and we show that our approach outperforms the state-of-the-art approaches.We also analyse the suitability of our generative model, consider the task of predicting therecently proposed biomarker TIB, and include a panel augmentation case study with theFoundation One gene panel.Since we are only looking to produce estimators for TMB and TIB, we group mutationsinto two categories – indel mutations and all other non-synonymous mutations – so that | S | = 2. This simpliﬁes the presentation of our results and reduces the computational costof ﬁtting the generative model. In order to assess the performance of each of the methodsin this section, we randomly split the dataset into training, validation and test sets, whichcontain n train = n = 800 , n val = 171 and n test = 173 samples, respectively. Mutations areobserved in | G | = 17358 genes. Our training set comprises samples with an average TMB of252 and TIB of 9 . The ﬁrst step in our analysis is to ﬁt the model proposed in Section 2.2 using only the trainingdataset. In particular, we obtain estimates of the model parameters using equation (4),where the tuning parameter κ is determined using 10-fold cross-validation as described inSection 2.5. The results are presented in Figure 3. The best choice of κ produces estimates of λ and η with 44 .

4% and 77 .

8% sparsity respectively, i.e. that proportion of their componentsare estimated to be exactly zero. We plot ˆ λ and ˆ η for this value of κ in Figures 4 and 5.Genes with ˆ λ g = 0 are interpreted to be mutating according to the background mutationrate, and genes with ˆ η g,indel = 0 are interpreted as having no speciﬁc selection pressure foror against indel mutations. In Figures 4 and 5 we highlight genes with large (in absolutevalue) parameter estimates, some of which have known biological relevance in oncology; seeSection 4 for further discussion.Figure 3: The average deviance (with one standard deviation) across the 10 folds in our cross-validation procedure plotted against log( κ ). The minimum average deviance is highlightedred.We now validate our model in (3) by comparing with the following alternatives:9igure 4: Manhattan plot of ﬁtted parameters ˆ λ g and their associated genes’ chromosomallocations. The genes with the ﬁve largest positive parameter estimates are labelled.Figure 5: Manhattan plot of ﬁtted parameters ˆ η g, indel and their associated genes’ chromoso-mal locations. The ﬁve largest positive and negative genes are labelled.10i) Saturated model : the model in (2), where each observation has an associated freeparameter (i.e. φ igs > No sample-speciﬁc eﬀects : the model in (3), with µ i = 0 for all i ∈ { , . . . , n } ;(iii) No gene-speciﬁc eﬀects : the model in (3), with λ g = η gs = 0 for all g ∈ G and s ∈ S ;(iv) No gene/mutation type interactions : the model in (3), with η gs = 0 for all g ∈ G and s ∈ S .In Table 1 we present the residual deviance and the residual degrees of freedom betweenour model and each of the models above. We see that our model is preferred over thesaturated model, and all three submodels of (3).Table 1: Model comparisons on the basis of residual deviance statistics.Comparison Residual Residual Degrees dev/df p -valueModel Deviance (dev) of Freedom (df)(i) 1 . × . × . × − . . × . × . × . . × . × . × . . × . × . × . We now demonstrate the excellent practical performance of our procedure for estimatingTMB. First it is shown that our method can indeed select gene panels of size speciﬁed bythe practitioner and that good predictions can be made even with small panel sizes (i.e. ≤ R scoreon the validation data as follows: given predictions of TMB, ˆ t , . . . , ˆ t n val , for the observationsin the validation set with true TMB values t , . . . , t n val . Let ¯ t := n val (cid:80) n val i =1 t i , and deﬁne R := 1 − (cid:80) n val i =1 ( t i − ˆ t i ) (cid:80) n val i =1 ( t i − ¯ t ) . Other existing works have aimed to classify tumours into two groups (high TMB, lowTMB); see, for example, B¨uttner et al. (2019) and Wu et al. (2019a). Here we also reportthe estimated area under the precision-recall curve (AUPRC) for a classiﬁer based on ourestimator. We deﬁne the classiﬁer as follows: ﬁrst, in line with major clinical studies (e.g.Hellmann et al., 2018; Ramalingam et al., 2018) the true class membership of a tumouris deﬁned according to whether it has t ∗ := 300 or more exome mutations (approximately10 Mut/Mb). In the validation set, this gives 47 (27 . . t ≥

0, we can deﬁne a classiﬁer by assigning atumour to the high TMB class if its estimated TMB value is greater than or equal to t . Forsuch a classiﬁer, we have precision and recall (estimated over the validation set) given by p ( t ) := (cid:80) n val i =1 { ˆ t i ≥ t, t i ≥ t ∗ } (cid:80) n val i =1 { ˆ t i ≥ t } and r ( t ) := (cid:80) n val i =1 { ˆ t i ≥ t, t i ≥ t ∗ } (cid:80) n val i =1 { t i ≥ t ∗ } , { ( r ( t ) , p ( t )) : t ∈ [0 , ∞ ) } . Note that a perfectclassiﬁer achieves a AUPRC of 1, whereas a random guess in this case would have an averageAUPRC of 0.308 (the prevalence of the high TMB class).Now recall that TMB is given by equation (1) with ¯ S being the set of all non-synonymousmutation types. Thus to estimate TMB we apply our procedure in Section 2.3 with ¯ S = S ,where the model parameters are estimated as described in Section 3.1. In Figure 6, wepresent the R and AUPRC for the ﬁrst-ﬁt and reﬁtted estimators (see (6) and (8)) as theselected panel size varies from 0Mb to 2Mb in length. We see that we obtain a more accurateprediction of TMB, both in terms of regression and classiﬁcation, as the panel size increases,and that good estimation is possible even with very small panels (as low as 0.2Mb). Finally,as expected, the reﬁtted estimator slightly outperforms the ﬁrst-ﬁt estimator.Figure 6: Performance of our ﬁrst-ﬁt and reﬁtted estimators of TMB as the selected panelsize varies. Left : R , Right : AUPRC.We now compare our method with state-of-the-art estimators applied to commonly usedgene panels. The three next-generation sequencing panels that we consider are chosen fortheir relevance to TMB. These are TST-170 (Heydt et al., 2018), Foundation One (Framptonet al., 2013) and MSK-IMPACT (Cheng et al., 2015). For each panel P ⊆ G , we use fourdiﬀerent methods to predict TMB:(i) Our reﬁtted estimator applied to the panel P : we estimate TMB using T ( ˆ w P ), whereˆ w P ∈ arg min w ∈ W P { f ( w ) } , and W P is deﬁned in (7).(ii) Estimation and Classiﬁcation of Tumour Mutation Burden (ecTMB): the procedureproposed by Yao et al. (2020).(iii) A count estimator: TMB is estimated by (cid:96) G (cid:96) P (cid:80) g ∈ P (cid:80) s ∈ ¯ S M gs , i.e. rescaling the muta-tion burden in the genes of P .(iv) A linear model: we estimate TMB via ordinary least-squares linear regression of TMBagainst (cid:8)(cid:80) s ∈ S M gs : g ∈ P (cid:9) .The latter three comprise existing methods for estimating TMB available to practitioners.The second (ecTMB), which is based on a negative binomial model, is the state-of-the-art.12he third and fourth are standard practical procedures for the estimation of TMB fromtargeted gene panels. The reﬁtted estimator applied to the panel P is also included here, inorder to demonstrate the utility of our approach even with a prespeciﬁed panel.Figure 7: The performance of our TMB estimator in comparison to existing approaches. Left : R , Right : AUPRC.We present results of these comparisons in Figure 7. First, for each of the three pan-els considered here, we see that our reﬁtted estimator applied to the panel outperforms allexisting approaches in terms of regression performance, and that for smaller panels we areable to improve regression accuracy even further by selecting a panel based on the trainingdata. For instance, in comparison to predictions based on the TST-170 panel, our procedurewith a selected panel of the same size (0 . R of 0 .

85. The best availableexisting method based on the TST-170 panel, in this case the linear estimator, has an R of0 .

74. Moreover, data-driven selection of panels considerably increases the classiﬁcation per-formance for the whole range of panel sizes considered. In particular, even for the smallestpanel size shown in Figure 7 ( ∼ R value (on the test data) of 0 .

85. The other methods have R values of 0 . −

36 (count) and 0 .

64 (linear regression). The count-based estimator here givespredictions which are reasonably well correlated to the true values of TMB but are positivelybiased. This is as expected, since our selection procedure tends to favour genes with higheroverall mutation rates. We also include a red shaded region comprising all points for which13euristic 90% prediction intervals (as described in Section 2.5) include the true TMB value.We ﬁnd in this case that 93 .

6% of the observations in the test set fall within this region,giving valid empirical coverage.Figure 8: Prediction of TMB on the test dataset. Dashed blue (diagonal) line representsperfect prediction, and the black dashed lines indicate true and predicted TMB values of300.

In this section we demonstrate how our method can be used to estimate TIB. This is morechallenging than estimating TMB due to the low abundance of indel mutations relative toother variant types (see Figure 2), as well as issues involved in sequencing genomic lociof repetitive nucleotide constitution (Narzisi and Schatz, 2015). Indeed, in contrast to theprevious section, we are not aware of any existing methods designed to estimate TIB fromtargeted gene panels. We therefore investigate the performance of our method across a muchwider range (0-30Mb) of panel sizes, and ﬁnd that we are able to accurately predict TIBwith larger panels. Our results also demonstrate that accurate classiﬁcation of TIB statusis possible even with small gene panels.We let S indel be the set of all frameshift insertion and deletion mutations, and apply ourmethod introduced in Section 2.3 with ¯ S = S indel . As in the previous section, we assessregression and classiﬁcation performance via R and AUPRC, respectively, where in thiscase tumours are separated into two classes: high TIB (10 or more indel mutations) and lowTIB (otherwise). In the validation dataset, this gives 57 (33 . Left : R , Right : AUPRC.The results are presented in Figure 9. We comment ﬁrst on the regression performance:as expected, we see that the R values for our ﬁrst-ﬁt and reﬁtted estimators are much lowerthan what we achieved in estimating TMB. The reﬁtted approach improves for larger panelsizes, while the ﬁrst-ﬁt estimator continues to perform relatively poorly. On the other hand,we see that the classiﬁcation performance is impressive, with AUPRC values of above 0.8for panels of less than 1Mb in size.We now assess the performance on the test set of our reﬁtted estimator of TIB appliedto a selected panel of size 0.6Mb, and we compare with a count-based estimator and linearregression estimator. We do not compare with ecTMB here, since it is designed to estimateTMB as opposed to TIB. The count-based estimator in this case scales the total numberof non-synonymous mutations across the panel by the ratio of the length of the panel tothat of the entire exome, and also by the relative frequency of indel mutations versus allnon-synonymous mutations in the training dataset: (cid:96) G (cid:96) P (cid:80) ni =1 (cid:80) g ∈ G (cid:80) s ∈ S indel M igs (cid:80) ni =1 (cid:80) g ∈ G (cid:80) s ∈ S M igs (cid:88) g ∈ P (cid:88) s ∈ S M gs . In Figure 10 we present the predictions on the test set of our reﬁtted estimator ( R = 0 . R = − R = 0 . .

7% of test set points fall within this region.

As discussed in Section 2.4, we may wish to include genes from a given panel, but use ourmethodology to augment the panel to include additional genes with goal of obtaining moreaccurate predictions of TMB (or other biomarkers). In this section we demonstrate how thiscan be done starting with the TST-170 panel ( ∼ P taken to be theset of TST-170 genes and Q to be empty. The genes added to the panel are determinedby the ﬁrst-ﬁt estimator in equation (9). To evaluate the performance, we then apply thereﬁtted estimator on the test dataset, after selecting the augmented panel of size 0.6Mb. Forcomparison, we apply our reﬁtted estimator to the TST-170 panel directly. We also presentthe results obtained by the other estimators described above, both before and after the panelaugmentation, in Table 2. We ﬁnd that by augmenting the panel we improve predictiveperformance with our reﬁtted ˆ T estimator, both in terms of regression and classiﬁcation.The reﬁtted estimator provides better estimates than any other model on the augmentedpanel by both metrics.Table 2: Predictive performance of models on TST-170 (0.4Mb) versus augmented TST-170(0.6Mb) panels on the test set.Model Regression ( R ) Classiﬁcation (AUPRC)TST-170 Aug. TST-170 TST-170 Aug. TST-170Reﬁtted ˆ T ecTMB 0.37 0.51 0.80 0.88Count 0.18 0.18 Linear 0.47 0.74 0.78 0.8916

Conclusions

We have introduced a new data-driven framework for designing targeted gene panels whichallows for cost-eﬀective estimation of exome-wide biomarkers. Using the Non-Small CellLung Cancer dataset from Campbell et al. (2016), we have demonstrated the excellent pre-dictive performance of our proposal for estimating Tumour Mutation Burden and TumourIndel Burden, and shown that it outperforms the state-of-the-art procedures. Our frameworkcan be applied to any tumour dataset containing annotated mutations, and we provide an R package (Bradley and Cannings, 2021) which implements the methodology.Our work also has the scope to help understand mutational processes. For example, theparameters of our ﬁtted model in Section 3.1 have interesting interpretations: of the ﬁvegenes highlighted in Figure 4 as having the highest mutation rates relative to the BMR,three ( TP53, KRAS, CDKN2A ) are known tumour suppressors (Olivier et al., 2010; Janˇc´ıket al., 2010; Foulkes et al., 1997). Furthermore, indel mutations in

KRAS are known to bedeleterious for tumour cells (Lee et al., 2018) – in our work the

KRAS gene has a largenegative indel-speciﬁc parameter (see Figure 5). Our methodology identiﬁes a number ofother genes with large parameter estimates.Finally, we believe there are many ways in which our general framework can be extended.For example, it may be adapted to incorporate alternate data types (e.g. transcriptomics);we may seek to predict other features (e.g. outcomes such as survival); or we may wish toextend the method to incorporate multiple data sources (e.g. on diﬀerent cancer types andtissues of origin).

We gratefully acknowledge funding provided by Cambridge Cancer Genomics (CCG) throughtheir PhD Scholarship at the University of Edinburgh. We also beneﬁted from discussionswith several individuals, including Adnan Akbar, Philip Beer, Harry Cliﬀord, AleksandraJartseva, Morton, Kevin Myant, William Orchard, Nirmesh Patel and Charlotte Paterson.

References

T. Boveri. Concerning the Origin of Malignant Tumours. Translated and annotated byHenry Harris.

Journal of Cell Science , 121(Supplement 1):1–84, Jan. 2008. ISSN 0021-9533, 1477-9137. doi: 10.1242/jcs.025742. Publisher: The Company of Biologists LtdSection: Article.J. R. Bradley and T. I. Cannings. ICBioMark: Data-Driven Design of Targeted Gene Pan-els for Estimating Immunotherapy Biomarkers (R Package), Jan. 2021. URL https://github.com/cobrbra/ICBioMark .E. I. Buchbinder and A. Desai. CTLA-4 and PD-1 Pathways: Similarities, Diﬀerences, andImplications of Their Inhibition.

American Journal of Clinical Oncology , 39(1):98–106,Feb. 2016. ISSN 1537-453X. doi: 10.1097/COC.0000000000000239.17. Budczies, M. Allg¨auer, and K. Litchﬁeld. Optimizing panel-based tumor mutationalburden (TMB) measurement.

Annals of Oncology: Oﬃcial Journal of the European Societyfor Medical Oncology , 30(9):1496–1506, 2019. ISSN 1569-8041. doi: 10.1093/annonc/mdz205.R. B¨uttner, J. W. Longshore, and F. L´opez-R´ıos. Implementing TMB measurement inclinical practice: considerations on assay requirements.

ESMO Open , 4(1):e000442, Jan.2019. ISSN 2059-7029. doi: 10.1136/esmoopen-2018-000442.J. D. Campbell, A. Alexandrov, and J. Kim. Distinct patterns of somatic genome alterationsin lung adenocarcinomas and squamous cell carcinomas.

Nature Genetics , 48(6):607–616,2016. ISSN 1546-1718. doi: 10.1038/ng.3564.D. Cao, H. Xu, and X. Xu. High tumor mutation burden predicts better eﬃcacy ofimmunotherapy: a pooled analysis of 103078 cancer patients.

Oncoimmunology , 8(9):e1629258, 2019. ISSN 2162-4011. doi: 10.1080/2162402X.2019.1629258.Z. R. Chalmers, C. F. Connelly, and D. Fabrizio. Analysis of 100,000 human cancer genomesreveals the landscape of tumor mutational burden.

Genome Medicine , 9, Apr. 2017. ISSN1756-994X. doi: 10.1186/s13073-017-0424-2.T. A. Chan, M. Yarchoan, and E. Jaﬀee. Development of tumor mutation burden as animmunotherapy biomarker: utility for the oncology clinic.

Annals of Oncology , 30(1):44–56, Jan. 2019. ISSN 0923-7534, 1569-8041. doi: 10.1093/annonc/mdy495.D. T. Cheng, T. N. Mitchell, and A. Zehir. Memorial Sloan Kettering-Integrated MutationProﬁling of Actionable Cancer Targets (MSK-IMPACT).

The Journal of Molecular Diag-nostics : JMD , 17(3):251–264, May 2015. ISSN 1525-1578. doi: 10.1016/j.jmoldx.2014.12.006.L. Fancello, S. Gandini, and P. G. Pelicci. Tumor mutational burden quantiﬁcation fromtargeted gene panels: major advancements and challenges.

Journal for ImmunoTherapyof Cancer , 7(1):183, July 2019. ISSN 2051-1426. doi: 10.1186/s40425-019-0647-4.Y. W. Fong, C. Cattoglio, and R. Tjian. The intertwined roles of transcription and repairproteins.

Molecular Cell , 52(3):291–302, Nov. 2013. ISSN 1097-4164. doi: 10.1016/j.molcel.2013.10.018.W. D. Foulkes, T. Y. Flanders, and P. M. Pollock. The CDKN2A (p16) gene and humancancer.

Molecular Medicine , 3(1):5–20, Jan. 1997. ISSN 1076-1551.G. M. Frampton, A. Fichtenholtz, and G. A. Otto. Development and validation of a clin-ical cancer genomic proﬁling test based on massively parallel DNA sequencing.

NatureBiotechnology , 31(11):1023–1031, Nov. 2013. ISSN 1546-1696. doi: 10.1038/nbt.2696.J. Friedman, T. Hastie, and R. Tibshirani. glmnet: Lasso and Elastic-Net RegularizedGeneralized Linear Models, June 2020. URL https://CRAN.R-project.org/package=glmnet . 18. H. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized LinearModels via Coordinate Descent.

Journal of Statistical Software , 33(1):1–22, Feb. 2010.ISSN 1548-7660. doi: 10.18637/jss.v033.i01.D. R. Gandara, S. M. Paul, and M. Kowanetz. Blood-based tumor mutational bur-den as a predictor of clinical beneﬁt in non-small-cell lung cancer patients treatedwith atezolizumab.

Nature Medicine , 24(9):1441–1448, 2018. ISSN 1546-170X. doi:10.1038/s41591-018-0134-3.S. Gettinger, H. Borghaei, and J. Brahmer. 5-Year Outcomes From the Randomized, Phase3 Trials CheckMate 017/057: Nivolumab vs Docetaxel in Previously Treated NSCLC.

Journal of Thoracic Oncology , 14(10):S244–S245, Oct. 2019. ISSN 1556-0864. doi: 10.1016/j.jtho.2019.08.486.M. Golkaram, C. Zhao, and K. Kruglyak. The interplay between cancer type, panel sizeand tumor mutational burden threshold in patient selection for cancer immunotherapy.

PLOS Computational Biology , 16(11):e1008332, Nov. 2020. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008332.G. Gong and F. J. Samaniego. Pseudo Maximum Likelihood Estimation: Theory and Ap-plications.

The Annals of Statistics , 9(4):861–869, 1981. ISSN 0090-5364.W. Guo, Y. Fu, and L. Jin. An Exon Signature to Estimate the Tumor Mutational Burdenof Right-sided Colon Cancer Patients.

Journal of Cancer , 11(4):883–892, 2020. ISSN1837-9664. doi: 10.7150/jca.34363.M. D. Hellmann, T.-E. Ciuleanu, and A. Pluzanski. Nivolumab plus Ipilimumab in LungCancer with a High Tumor Mutational Burden.

New England Journal of Medicine , 378(22):2093–2104, May 2018. ISSN 0028-4793. doi: 10.1056/NEJMoa1801946.C. Heydt, R. Pappesch, and K. Stecker. Evaluation of the TruSight Tumor 170 (TST170)assay and its value in clinical research.

Annals of Oncology , 29:vi7–vi8, Sept. 2018. ISSN0923-7534, 1569-8041. doi: 10.1093/annonc/mdy318.003.Y. Ishida, Y. Agata, and K. Shibahara. Induced expression of PD-1, a novel member of theimmunoglobulin gene superfamily, upon programmed cell death.

The EMBO journal , 11(11):3887–3895, Nov. 1992. ISSN 0261-4189.S. Janˇc´ık, J. Dr´abek, and D. Radzioch. Clinical Relevance of KRAS in Human Cancers.

Journal of Biomedicine and Biotechnology , 2010, 2010. ISSN 1110-7243. doi: 10.1155/2010/150960.D. R. Leach, M. F. Krummel, and J. P. Allison. Enhancement of antitumor immunity byCTLA-4 blockade.

Science (New York, N.Y.) , 271(5256):1734–1736, Mar. 1996. ISSN0036-8075. doi: 10.1126/science.271.5256.1734.H. Ledford, H. Else, and M. Warren. Cancer immunologists scoop medicine Nobel prize.

Nature , 562(7725):20–21, Oct. 2018. doi: 10.1038/d41586-018-06751-0. Number: 7725Publisher: Nature Publishing Group. 19. Lee, J. H. Lee, and S. Jun. Selective targeting of KRAS oncogenic alleles byCRISPR/Cas9 inhibits proliferation of cancer cells.

Scientiﬁc Reports , 8(1):11879, Aug.2018. ISSN 2045-2322. doi: 10.1038/s41598-018-30205-2. Number: 1 Publisher: NaturePublishing Group.G.-Y. Lyu, Y.-H. Yeh, and Y.-C. Yeh. Mutation load estimation model as a predictor of theresponse to cancer immunotherapy. npj Genomic Medicine , 3(1):1–9, Apr. 2018. ISSN2056-7944. doi: 10.1038/s41525-018-0051-x. Number: 1 Publisher: Nature PublishingGroup.K. D. Makova and R. C. Hardison. The eﬀects of chromatin organization on variation inmutation rates in the genome.

Nature Reviews Genetics , 16(4):213–223, Apr. 2015. ISSN1471-0064. doi: 10.1038/nrg3890. Number: 4 Publisher: Nature Publishing Group.T. Michoel. Natural coordinate descent algorithm for L1-penalised regression in generalisedlinear models.

Computational Statistics & Data Analysis , 97:60–70, May 2016. ISSN0167-9473. doi: 10.1016/j.csda.2015.11.009.G. Narzisi and M. C. Schatz. The Challenge of Small-Scale Repeats for Indel Discovery.

Frontiers in Bioengineering and Biotechnology , 3, Jan. 2015. ISSN 2296-4185. doi: 10.3389/fbioe.2015.00008.T. S. Nowicki, S. Hu-Lieskovan, and A. Ribas. Mechanisms of Resistance to PD-1 andPD-L1 blockade.

Cancer journal (Sudbury, Mass.) , 24(1):47–53, 2018. ISSN 1528-9117. doi: 10.1097/PPO.0000000000000303. URL .M. Olivier, M. Hollstein, and P. Hainaut. TP53 Mutations in Human Cancers: Origins,Consequences, and Clinical Use.

Cold Spring Harbor Perspectives in Biology , 2(1), Jan.2010. ISSN 1943-0264. doi: 10.1101/cshperspect.a001008.D. M. Pardoll. The blockade of immune checkpoints in cancer immunotherapy.

Naturereviews. Cancer , 12(4):252–264, Mar. 2012. ISSN 1474-175X. doi: 10.1038/nrc3239.S. S. Ramalingam, M. D. Hellmann, and M. M. Awad. Tumor mutational burden (TMB)as a biomarker for clinical beneﬁt from dual immune checkpoint blockade with nivolumab(nivo) + ipilimumab (ipi) in ﬁrst-line (1L) non-small cell lung cancer (NSCLC).

CancerResearch , 78(13 Supplement):CT078–CT078, July 2018. ISSN 0008-5472, 1538-7445. doi:10.1158/1538-7445.AM2018-CT078.C. Robert. A decade of immune-checkpoint inhibitors in cancer therapy.

Nature Communi-cations , 11(1):3801, July 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-17670-y.V. Roth and B. Fischer. The Group-Lasso for generalized linear models: uniqueness ofsolutions and eﬃcient algorithms. In

Proceedings of the 25th international conference onMachine learning , ICML ’08, pages 848–855, New York, NY, USA, July 2008. Associationfor Computing Machinery. ISBN 9781605582054. doi: 10.1145/1390156.1390263.A. Sboner, X. J. Mu, and D. Greenbaum. The real cost of sequencing: higher than you think!

Genome Biology , 12(8):125, Aug. 2011. ISSN 1474-760X. doi: 10.1186/gb-2011-12-8-125.20. Turajlic, K. Litchﬁeld, and H. Xu. Insertion-and-deletion-derived tumour-speciﬁc neoanti-gens and the immunogenic phenotype: a pan-cancer analysis.

The Lancet. Oncology , 18(8):1009–1021, 2017. ISSN 1474-5488. doi: 10.1016/S1470-2045(17)30516-8.H.-X. Wu, Z.-X. Wang, and Q. Zhao. Designing gene panels for tumor mutational burdenestimation: the need to shift from ’correlation’ to ’accuracy’.

Journal for Immunotherapyof Cancer , 7(1):206, 2019a. ISSN 2051-1426. doi: 10.1186/s40425-019-0681-2.H.-X. Wu, Z.-X. Wang, and Q. Zhao. Tumor mutational and indel burden: a systematicpan-cancer evaluation as prognostic biomarkers.

Annals of Translational Medicine , 7(22):640, Nov. 2019b. ISSN 2305-5847. doi: 10.21037/31486. Number: 22.Y. Yang and H. Zou. A fast uniﬁed algorithm for solving group-lasso penalize learningproblems.

Statistics and Computing , 25(6):1129–1141, Nov. 2015. ISSN 1573-1375. doi:10.1007/s11222-014-9498-5.Y. Yang, H. Zou, and S. Bhatnagar. gglasso: Group Lasso Penalized Learning Using a UniﬁedBMD Algorithm, Mar. 2020. URL https://CRAN.R-project.org/package=gglasso .L. Yao, Y. Fu, and M. Mohiyuddin. ecTMB: a robust method to estimate and classifytumor mutational burden.

Scientiﬁc Reports , 10(1):1–10, Mar. 2020. ISSN 2045-2322.doi: 10.1038/s41598-020-61575-1. Number: 1 Publisher: Nature Publishing Group.A. D. Yates, P. Achuthan, and W. Akanni. Ensembl 2020.

Nucleic Acids Research , 48(D1):D682–D688, Jan. 2020. ISSN 0305-1048. doi: 10.1093/nar/gkz966.J. Zhu, T. Zhang, and J. Li. Association Between Tumor Mutation Burden (TMB) andOutcomes of Cancer Patients Treated With PD-1/PD-L1 Inhibitions: A Meta-Analysis.