[PDF] Maximum n-times Coverage for Vaccine Design

Abstract

We introduce the maximum n-times coverage problem that selects k overlays to maximize the summed coverage of weighted elements, where each element must be covered at least n times. We also define the min-cost n-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least n times is at least \tau. Maximum n-times coverage is a generalization of the multi-set multi-cover problem, is NP-complete, and is not submodular. We introduce two new practical solutions for n-times coverage based on integer linear programming and sequential greedy optimization. We show that maximum n-times coverage is a natural way to frame peptide vaccine design, and find that it produces a pan-strain COVID-19 vaccine design that is superior to 29 other published designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.

Full PDF

MMaximum n -times Coverage for COVID-19 Vaccine Design Ge Liu, Brandon Carter, David Gifford MIT Computer Science and Artiﬁcial Intelligence Laboratory, Cambridge, MA, [email protected], [email protected], [email protected]

Abstract

In the maximum n -times coverage problem, we are provideda set of elements, a weight for each element, and a set of over-lays where each overlay speciﬁes an element speciﬁc cover-age of zero or more times. The goal is to select up to k over-lays such that the sum of the weights of elements that arecovered at least n times is maximized. We also deﬁne themin-cost n -times coverage problem where the objective is toselect the minimum set of overlays such that the sum of theweights of elements that are covered at least n times is atleast τ . We show that the n -times coverage objective is notsubmodular, and we present an efﬁcient solution by sequen-tial greedy optimization. We frame the design of a peptidevaccine for COVID-19 as maximum n -times coverage usingmachine learning deﬁned candidate peptide sets, and showthat our solution is superior to 29 other published COVID-19peptide vaccine designs in predicted population coverage andthe expected number of peptides displayed by each individ-ual’s HLA molecules. Introduction

The COVID-19 pandemic has caused an unprecedentedchallenge for healthcare systems worldwide, and developinga vaccine that produces durable immunity with high popu-lation coverage is of upmost importance. A peptide vaccineuses a set of peptides to elicit a protective cellular immuneresponse (Ott et al. 2017; Li et al. 2014). For a peptide tobe effective in a vaccine it must be presented by an individ-ual’s Major Histocompatibility Complex (MHC) moleculesand be immunogenic. A displayed peptide is immunogenicwhen it activates T cells, which expand in number and mounta response against pathogens or tumor cells. Memory T cellsprovide robust immunity against COVID-19, even in the ab-sence of detectable antibodies (Sekine et al. 2020). The useof peptide vaccine components to activate T cells is in de-velopment for cancer (Hu, Ott, and Wu 2018) and viral dis-eases including HIV (Arunachalam et al. 2020), HPV (Ken-ter et al. 2009) and malaria (Li et al. 2014; Rosendahl Huberet al. 2014).A challenge for the design of peptide vaccines is the di-versity of human MHC alleles that each have speciﬁc pref-erences for the peptide sequences they will display. TheHuman Leukocyte Antigen (HLA) loci, located within theMHC, encode the HLA class I and class II molecules. We consider the three classical class I loci (HLA-A, HLA-B,and HLA-C) and three loci that encode class II molecules(HLA-DR, HLA-DQ, and HLA-DP). An individual’s HLAtype describes the alleles they carry at each of these loci.Peptides of length 8–10 residues can bind to HLA class Imolecules whereas those of length 13–25 bind to HLA classII molecules (Rist et al. 2013; Chicz et al. 1992). To createeffective vaccines it is necessary to consider the HLA allelicfrequency in the population, as well as linkage disequilib-rium between HLA genes to discover a set of peptides thatis likely to be robustly displayed.For a peptide vaccine to be effective it needs to invoke arobust cellular immune response with a diverse set of pep-tides (Schultheiß et al. 2020; Grifoni et al. 2020b), whichmotivates our use of n -times coverage. Redundant coverageis also important to compensate for our imperfect ability topredict what peptides will be immunogenic in a given in-dividual. We adopt an experimentally calibrated model ofpeptide-HLA immunogenicity to create a set of candidatesand perform population coverage optimization to select acompact set of peptides that provides diverse display of pep-tides in each individual. Our vaccine designs and evaluationsare based upon the observed immunogenicity of peptides inconvalescent COVID-19 patient peripheral blood mononu-clear cell samples (Snyder et al. 2020; Klinger et al. 2015;Nolan et al. 2020), and by ensembling machine learning pre-dictions for peptides or HLA alleles that are not observed inthe clinical data (Liu, Carter, and Gifford 2021).We formalize the number of times that an individual dis-plays peptides as a coverage problem, and thus we gener-alize single coverage problems to n -times coverage prob-lems. We show that framing vaccine design as maximum n -times coverage produces a solution that produces superiorpredicted population coverage when compared to 29 previ-ous published vaccines for COVID-19 with less than 150peptides. We also show that maximum n -times coverage isnot submodular, and introduce M ARGINAL G REEDY , an ef-ﬁcient algorithm for solving n -times coverage problem onboth synthetic data and real vaccine design.In the maximum n -times coverage problem, we are pro-vided a set of elements, a weight for each element, and a setof overlays where each overlay speciﬁes an element speciﬁccoverage of zero or more times. The goal is to select up to k overlays such that the sum of the weights of elements that a r X i v : . [ q - b i o . Q M ] J a n re covered at least n times is maximized. In this framing,an element is a speciﬁc collection of HLA alleles (a haplo-type), weights are the frequency of haplotypes in the popula-tion, n is the desired number of peptides displayed by eachindividual, and an overlay is a peptide that is predicted tobe displayed by each haplotype a speciﬁed number of times.The solution of the maximum n -time coverage problem al-lows us to ﬁnd a set of overlays (peptides) that maximizesthe sum of weights (population coverage).Our introduction of weighted elements and required n -times coverage creates a new class of problem without aknown solution or complexity bound. The closest problemto the maximum n -times coverage problem is the multi-setmulti-cover problem which does not assign weights to theelements and thus can be formulated as the Covering In-teger Program (CIP) problem (Srinivasan 1999). A log( n ) -time approximation algorithm for CIP can violate coverageconstraints (Dobson 1982; Kolliopoulos 2003; Kolliopoulosand Young 2001). Other existing robust coverage problemsdo not address the need for explicit n -times coverage of ele-ments to ensure each individual is predicted to display multi-ple peptides. Deletion-robust submodular maximization pro-tects against adversarial deletion (Bogunovic, Zhao, andCevher 2018; Mirzasoleiman, Karbasi, and Krause 2017)and robust submodular optimization protects against objec-tive function uncertainty (Iyer 2019), but neither guaranteesthat each element is covered n times. The Maximum n -times coverage problem S ET C OVER and M

AXIMUM C OVERAGE

In the standard S ET C OVER and M

AXIMUM C OVERAGE problems, we are given a set U of |U| elements (also knownas the universe) and a collection S = { S , S , . . . , S m } of m subsets of U such that (cid:83) i S i = U . The goal in the S ET C OVER problem is to select a minimal-cardinality set of sub-sets from S such that their union covers U . In the M AXI - MUM C OVERAGE problem, our goal is to select k subsetsfrom S such that the union of these subsets contains maxi-mum number of elements. Both problems are known to beNP-hard (Feige 1998), but can be solved with a greedy ap-proximation algorithm with the best guaranteed approxima-tion ratio.A generalization of S ET C OVER is to use any submod-ular set function as the objective instead of the cardinal-ity function. Given a set function f : 2 V → R that as-sign each subset S ⊆ V a value f ( S ) where V is a ﬁniteset, we deﬁne the discrete derivative (or marginal gain ) as ∆ f ( e | S ) := f ( S ∪ { e } ) − f ( S ) for e ∈ V . A set function f is submodular iff it satisﬁes the following diminishing returnproperty: Deﬁnition 1. (Submodularity) A function f : 2 V → R issubmodular if for every A ⊆ B ⊆ V and e ∈ V \ B it holdsthat ∆ f ( e | A ) ≥ ∆ f ( e | B ) . Equivalently, f is a submodular function if for every A, B ⊆ v, f ( A ∩ B ) + f ( A ∪ B ) ≤ f ( A ) + f ( B ) .We can then deﬁne the S UBMODULAR S ET C OVER prob-lem as the problem of selecting a minimal-cardinality subset S ⊆ V such that f ( S ) = f ( V ) , where f is a monotone sub-modular function and V is the ground set of items. A similarproblem is the M IN - COST C OVERAGE problem: S ∗ = arg min S | S | s.t. f ( S ) ≥ τ The S

UBMODULAR M AXIMUM C OVERAGE problem is de-ﬁned as selecting a subset S ⊆ V that maximizes f ( S ) withcardinality constraint | S | ≤ k , where f is monotone andsubmodular.Given f is monotone submodular, a greedy algorithm isknown to provide (1 − e ) approximation for S UBMODU - LAR M AXIMUM C OVERAGE and ln(max i ∈ V f ( i )) + 1 ap-proximation for S UBMODULAR S ET C OVER (Nemhauser,Wolsey, and Fisher 1978). M ULTI - SET M ULTI - COVER problem

The M

ULTI - SET M ULTI - COVER (MSMC) problem is a gen-eralization of S ET C OVER , where multi-sets are special typeof sets in which an element e ∈ V can appear in S for morethan once. The objective of the MSMC problem is to deter-mine the minimum number of multisets (a multi-set can bechosen multiple times) such that each element i is coveredat least b i times. It can be formulated into Covering IntegerProgram (CIP) (Srinivasan 1999) problem: Deﬁnition 2. (Covering Integer Program, CIP) Given A ∈ R n × m + , b ∈ R n + , w ∈ R m + , d ∈ R m + , a CIP P = ( A, b, w, d ) seeks to minimize w T x , subject to Ax ≥ b, x ∈ Z m + ,and x ≤ d . Here A ij represents the number of times i -th elementappears in the j -th multi-set. The w is set to be all 1 inMSMC. The constraints x ≤ b are called multiplicity con-straints which limit the number of times a single multi-set can be reused, and they generally make covering prob-lems much harder as natural linear programming (LP) re-laxation can have an unbounded integrality gap (Chuzhoyand Naor 2006). Dobson (1982) provides a combinatorialgreedy H ( max j (cid:80) i A ij ) -approximation algorithm ( H ( t ) stands for t -th harmonic number) but multiplicity constraintscan be dealt with effectively only in the (0,1) case thus thisalgorithm can be as bad as polynomial. Kolliopoulos andYoung (2001); Kolliopoulos (2003) gave a tighter-bound so-lution that can obtain O ( logn ) -approximation. M AXIMUM n - TIMES C OVERAGE

We introduce the maximum n -times coverage problem, avariant of the MSMC problem that accounts for multi-ple coverage of each element while also assigning weightsto different elements. We are given a set X with l el-ements { X , X , . . . , X l } each associated with a non-negative weight w ( X i ) , and a set of m overlays A = { A , A , . . . , A m } . Each overlay A j covers each element i in X an element speciﬁc number of times c j ( X i ) , which issimilar to a multi-set. When c j ( X i ) = 0 the element X i isnot covered by overlay A j . We use a very strict multiplic-ity constraint such that each overlay can be used only once.Given a subset of overlays O ⊆ A , the total number of timesigure 1: Example of n -times coverage calculation.an element X i is covered by O is the sum of c j ( X i ) for eachoverlay j in O : C ( X i | O ) = (cid:88) j ∈ O c j ( X i ) (1)We deﬁne the n - times coverage function f n ( O ) as thesum of weights of elements in X that are covered at least n times by O . Figure 1 shows an example computation ofthe n - times coverage function . f n ( O ) = l (cid:88) i =1 w ( X i ) { C ( X i | O ) ≥ n } = l (cid:88) i =1 w ( X i ) { (cid:80) j ∈ O c j ( X i ) ≥ n } (2)The objective of the M AXIMUM n - TIMES C OVERAGE prob-lem is to select k overlays O ⊆ A such that f n ( O ) is max-imized. This can be formulated as the maximization of themonotone set function f n ( O ) under cardinality constraint k , O ∗ = arg max O ⊆A , | O |≤ k f n ( O ) (3)We deﬁne M IN - COST n - TIMES C OVERAGE as the min-imum set of overlays such that the sum of the weights ofelements covered at least n times is ≥ τ . We assume A pro-vides sufﬁcient n -times coverage for f n ( A ) ≥ τ . We de-ﬁne the n - TIMES S ET C OVER problem as the special caseof M IN - COST n - TIMES C OVERAGE where τ = (cid:80) i w ( X i ) .M IN - COST n - TIMES C OVERAGE O ∗ = arg min O | O | s.t. f n ( O ) ≥ τ (4) n - TIMES S ET C OVER O ∗ = arg min O | O | s.t. f n ( O ) = f n ( A ) (5) Theorem 1.

The n -times coverage function f n ( O ) is notsubmodular.Proof. We show f n ( O ) is not submodular for n = 2 (a sim-ilar counter example can be found for any n > ). Con-sider the example overlays A in Table 1. When n = 2 , noneof { A } , { A } , { A } or { A , A } provides n -times coverage Table 1: Coverage map of overlays used in the counter ex-ample w ( X i ) X X X X A A A { A , A } covers X two times and { A , A } covers X two times. Therefore, the marginal gain of adding A , A , or A into an empty set is always zero, whereas adding A into { A } or { A } achieves non-zero gains and adding A into { A , A } achieves even higher gain: ∆ f ( A |∅ ) := f n =2 ( { A } ) − f n =2 ( ∅ )= 0∆ f ( A |{ A } ) := f n =2 ( { A , A } ) − f n =2 ( { A } )= w ( X ) − . f ( A |{ A } ) := f n =2 ( { A , A } ) − f n =2 ( { A } )= w ( X ) − . f ( A |{ A , A } ) := f n =2 ( { A , A , A } ) − f n =2 ( { A , A } )= w ( X ) + w ( X ) − . Given that { A } ⊆ { A , A } and ∆ f ( A |{ A } ) < ∆ f ( A |{ A , A } ) , the function f n =2 ( O ) does not satisfythe diminishing return property (Deﬁnition 1) and thus isnot submodular.Since the n -times coverage problem is not submodular,we cannot take advantage of the proven near-optimal per-formance of the greedy algorithm for the S UBMODULAR M AXIMUM C OVERAGE problem. Thus we seek new solu-tions.

Vaccine population coverage maximization

Peptide vaccine design can be framed as an instance ofthe maximum n -time coverage problem. Peptides must im-munogenic to be effective in a vaccine, and we use an inte-grated model of peptide-HLA immunogenicity (Supplemen-tal Data). We deﬁne a peptide-HLA hit as a (peptide, HLAallele) pair where the peptide is predicted to immunogenicnd displayed the HLA allele. Once we have determined acandidate set of peptides that are predicted to be immuno-genic we then need to select a minimal subset of peptidessuch that each individual in τ target of the population is pre-dicted to have n peptide-HLA hits.We ﬁrst introduce EvalVax-Robust , an evaluation tool forestimating the population coverage of a proposed peptidevaccine set.

EvalVax-Robust evaluates the percentage of thepopulation having at least n peptide-HLA binding hits ineach individual. It takes into consideration allelic zygosity,and utilizes HLA haplotype frequencies for MHC class I(HLA-A/B/C) and MHC class II (HLA-DP/DQ/DR) genesto account for linkage disequilibrium (LD) among loci.The MHC class I genes are encoded by the HLA-A, HLA-B, and HLA-C loci and a single chromosome contains oneallele at each locus that is described by A i , B j , and C k respectively. The haplotype frequency of A i B j C k is de-noted by G ( i, j, k ) and (cid:80) M A i =1 (cid:80) M B j =1 (cid:80) M C k =1 G ( i, j, k ) = 1 .Each individual inherits two haplotypes to form their diploidgenotype. The the frequency of a diploid genotype is thus: F i j k i j k = F ( A i B j C k , A i B j C k )= G ( i , j , k ) G ( i , j , k ) (6)For a given peptide p and an HLA allele X ∈ A ∪ B ∪ C ,our machine learning model outputs a binary hit prediction e p ( X ) ∈ { , } indicating peptide-HLA immunogenicity.For each diploid genotype we compute the total number ofpeptide-HLA hits as the sum of e p ( X ) over the unique HLAalleles in the genotype. There can be 3-6 unique alleles de-pending on the zygosity of each locus: c i j k i j k p = c p ( A i B j C k , A i B j C k )= (cid:88) X ∈{ A i ,B j ,C k }∪{ A i ,B j ,C k } e p ( X ) (7)The total number of peptide-HLA hits provided by a set ofpeptides P is the sum of number of hits per peptide: C i j k i j k P = C ( A i B j C k , A i B j C k | P )= (cid:88) p ∈ P c p ( A i B j C k , A i B j C k ) (8)We deﬁne the EvalVax-Robust predicted population cover-age with ≥ n peptide-HLA hits for a peptide vaccine set P as: f n ( P ) = M A (cid:88) i =1 M B (cid:88) j =1 M C (cid:88) k =1 M A (cid:88) i =1 M B (cid:88) j =1 M C (cid:88) k =1 F i j k i j k · { C i j k i j k P ≥ n } (9)EvalVax-Robust optimization can be accomplished by a so-lution to maximum n -times coverage. X corresponds to thepossible genotypes, and the weights w ( X i ) are populationgenotype frequencies F ( A i B j C k , A i B j C k ) . The pep-tide candidate set is the set of possible overlays V , whereeach peptide p ∈ V is an overlay which covers a genotype c i j k i j k p times. The M

ARGINAL G REEDY algorithm formaximum n -times coverage We observed that greedy solutions to the maximum n -timescoverage problem are problematic when the n -times objec-tive is directly approached. The greedy approach can fail if itis unable to ﬁnd early overlays with sufﬁcient n -times cover-age, and can fail later because of early bad overlay choices.Both problems result in subsequent overlay choices not pro-viding sufﬁcient marginal gain to avoid ties and the randomselection of subsequent overlays. Thus we needed a solutionthat did not try to directly solve for n -times coverage at theoutset, but rather marginally approached it. Algorithm 1 M ARGINAL G REEDY algorithm (for M IN - COST n - TIMES C OVERAGE ) Input:

Weights of the elements in X : w ( X ) , w ( X ) , . . . , w ( X l ) , ground set of over-lays A where each overlay j in A covers X i for c j ( X i ) times, target minimum n target ,coverage cutoff for different n : τ , τ , . . . , τ n target , beamsize b Initialize beam B ← {∅} , t = 0 for n = 1 , . . . , n target do Using set function f n ( O ) = (cid:80) li =1 w ( X i ) { (cid:80) j ∈ O c j ( X i ) ≥ n } as objective function. repeat solution candidate set K t ← ∅ for O ∈ B t do (cid:46) Beam search for a ∈ A \ O do K t ← K t ∪ { O ∪ { a }} B t +1 ← { Top b candidate overlay subsets in K t scored by f n ( O ) } t ← t + 1 m t ← median score of candidate overlay sets inthe beam B t until m t ≥ τ n (in n target -th cycle the terminationcondition is max O ∈ B t f n ( O ) ≥ τ target )(for M AXIMUM n - TIMES C OVERAGE with cardinal-ity constraint k , there is additionaltermination condition t > k ) O ∗ = arg max O ∈ B t f n target ( O ) Output:

The ﬁnal selected subset of overlays O ∗ We propose M

ARGINAL G REEDY , an efﬁcient algorithmthat optimizes n -times coverage with a sequence of greedyoptimization cycles where the n -th cycle optimizes the cov-erage function f n ( O ) . We establish coverage starting at n =1 and increase the criteria to n target . Thus early overlay se-lections are guided by less stringent cycle-speciﬁc coverageobjectives. A set of coverage cutoffs { τ , τ , . . . , τ target } isused as the termination condition for each greedy optimiza-tion cycle, and when not speciﬁed, we assume τ = τ = · · · = τ target by default. We use beam search to keep trackof top b candidate solutions at each iteration.The full algorithm is given in Algorithm 1. A similar algo-rithm can be used to solve the n - TIMES S ET C OVER prob-lem in which τ target = f n target ( A ) . For the M AXIMUM n -igure 2: The M ARGINAL G REEDY algorithm outperforms the greedy algorithm on both LARGE and SMALL toy examples.Superior performance is seen on both the n - TIMES S ET C OVER where a smaller number of overlays is used by M

ARGINAL -G REEDY to achieve 100% coverage at different criteria of n , and on the M AXIMUM n - TIMES C OVERAGE problems whereM

ARGINAL G REEDY achieves higher coverage given same number of overlays. Shaded regions indicate 95% conﬁdence inter-vals.

TIMES C OVERAGE problem with cardinality constraint k ,the optimization terminates when t > k . When beam searchis not used to reduce computation time ( b = 1 ), we haveextended the algorithm to break ties during the n -times cov-erage iteration by looking ahead to ( n + 1) -times coverage.We call this extension look ahead tie-breaking .Another advantage of M ARGINAL G REEDY is that it is ca-pable of guaranteeing high coverage for n t < n target bycontrolling the cycle-speciﬁc coverage cutoffs τ t . This is de-sired in vaccine design where wider population coverage isalso important and we want to make sure that almost 100%of the population will be covered at least once. Results M ARGINAL G REEDY outperforms a greedybaseline on toy examples

We empirically evaluate our M

ARGINAL G REEDY algorithmwith two toy examples: a LARGE dataset where a setof 30 overlays are randomly generated to cover a set of10 elements (with equal weights) 0, 1, or 2 times, and aSMALL dataset where a set of 10 overlays that were ran-domly generated to cover a set of 5 elements with equalweights 0, 1, or 2 times. Figure 2 shows the efﬁciency ofthe M

ARGINAL G REEDY algorithm on both the n - TIMES S ET C OVER and M

AXIMUM n - TIMES C OVERAGE prob-lems for varying values of n . We compare our M ARGINAL -G REEDY algorithm to a greedy algorithm that directly opti-mizes n -times coverage and ﬁnd the greedy algorithm de-generates to sub-optimal solutions for large values of n .These sub-optimal solutions can be as bad as random. Werepeated problem generation and optimization with 6 dif- ferent random seeds to provide conﬁdence bounds and usedbeam size b = 1 with look-ahead tie-breaking. As shownin Figure 2, M ARGINAL G REEDY outperforms greedy on alltasks and datasets. We observed that M

ARGINAL G REEDY has a signiﬁcant advantage over greedy in tests with larger n and in regions of higher coverage. We also compared theperformance of polynomial-time M ARGINAL G REEDY to anexponential-time exhaustive search procedure that discoversthe optimal set of overlays on smaller problems. We foundM

ARGINAL G REEDY produces solutions close to optimal.

COVID-19 vaccine design using maximum n -timescoverage We used our M

ARGINAL G REEDY algorithm to designa peptide vaccine for COVID-19 using the EvalVax-Robust objective described in (9). We obtained the SARS-CoV-2 viral proteome from the ﬁrst documented casefrom GISAID (Elbe and Buckland-Merrett 2017) (se-quence entry Wuhan/IPBCAMS-WH-01/2019). We usedNextstrain (Hadﬁeld et al. 2018) to identify open readingframes (ORFs) and translate the sequence. We applied slid-ing windows of length 8–10 (MHC class I) and 13–25 (MHCclass II) to identify candidate peptides for inclusion in ourpeptide vaccine, resulting in 29,403 candidate peptides forMHC class I and 125,593 candidate peptides for MHC classII. We exclude peptides that may be glycosylated (Gupta,Jung, and Brunak 2004; Wolfert and Boons 2013), crossknown cleavage sites (Wang et al. 2020; Consortium 2019),overlap with human self-peptides (Consortium 2019), orhave a mutation rate > . (as calculated over 4,690 ge-ographically sampled viral genomes) (Elbe and Buckland-igure 3: EvalVax population coverage evaluation of SARS-CoV-2 vaccines for (A) MHC class I and (B) MHC class II. (a)EvalVax-Robust population coverage with n ≥ peptide-HLA hits per individual, OptiVax-Robust performance is shownby the blue curve and baseline performance is shown by red crosses (labeled by name of ﬁrst author). (b) EvalVax-Robustpopulation coverage with n ≥ peptide-HLA hits. (c) EvalVax-Robust population coverage with n ≥ peptide-HLA hits. (d)Comparison of OptiVax-Robust and baselines on expected number of peptide-HLA hits.Merrett 2017; Hadﬁeld et al. 2018) (details in Supplement).We use observed haplotype frequencies of 230 differentHLA-A, HLA-B, and HLA-C loci for class I computationsand observed haplotype frequencies of 280 different HLA-DP, HLA-DQ, and HLA-DR for class II computations (Liuet al. 2020). This dataset contains independent haplotypefrequency measurements for three populations self-reportingas having White, Black, or Asian ancestry. We score peptide-HLA immunogenicity based upon clinical data from conva-lescent COVID-19 patients (Liu, Carter, and Gifford 2021).For peptide-HLA combinations not observed in our clinicaldata we selected a machine learning model of immunogenic-ity that best predicted the peptide-HLA combinations we didobserve. Our selected MHC class I model predicts a peptide-HLA combination will be immunogenic if they are predictedto bind with an afﬁnity of ≤ ≤

50 nMby NetMHCIIpan-4.0.We implemented the OptiVax-Robust method for peptidevaccine design using the M

ARGINAL G REEDY algorithm,but modiﬁed M

ARGINAL G REEDY to reduce vaccine redun-dancy by eliminating unselected peptides that are withinthree (MHC class I) or ﬁve (MHC class II) edits on a se-quence distance metric from the selected peptides at eachiteration. We used a beam size of 10 for MHC class I and 5for MHC class II.Contemporary peptide vaccine design methods use ma-chine learning scoring of peptide display for speciﬁc HLAalleles followed by the manual selection of peptide sets or agreedy search that maximizes coverage at n =1. These meth- ods do not utilize the distribution of HLA haplotypes in apopulation and thus can not accurately assess the populationcoverage provided by a vaccine. For a selected set of pep-tides, the IEDB Population Coverage Tool (Bui et al. 2006)estimates peptide-MHC binding coverage and the distribu-tion of peptides displayed for a given population but assumesindependence between different loci and thus does not con-sider linkage disequilibrium. However, the IEDB tool doesnot provide automated peptide selection as we describe.We compared our vaccine designs with 29 vaccine de-signs in the literature on the probability that a vaccine hasat least N peptide-HLA hits per individual in a population,and the expected number of per individual peptide-HLA hitsin the population, which provides insight on how well a vac-cine is displayed on average (Lee and Koohy 2020; Fast,Altman, and Chen 2020; Poran et al. 2020; Bhattacharyaet al. 2020; Baruah and Bose 2020; Abdelmageed et al.2020; Ahmed, Quadeer, and McKay 2020; Srivastava et al.2020; Herst et al. 2020; Vashi, Jagrit, and Kumar 2020;Akhand, Azim, and et al. 2020; Mitra et al. 2020; Khanet al. 2020; Banerjee, Santra, and Maiti 2020; Ramaiah andArumugaswami 2020; Gupta, Mishra, and Niraj 2020; Sahaand Prasad 2020; Tahir ul Qamar et al. 2020; Singh et al.2020). Figure 3 shows the comparison between OptiVax-Robust (A) MHC class I and (B) class II vaccine designsat all vaccine sizes (top solution in the beam up to the givenvaccine size) from 1-45 peptides (blue curves) and baselinevaccines (red crosses) from prior work. We observe supe-rior performance of OptiVax-Robust vaccine designs on allevaluation metrics at all vaccine sizes for both MHC class Iand class II. Most baselines achieve reasonable coverage at n ≥ peptide hits. However, many fail to show a high prob-ability of higher hit counts, indicating a lack of predictedredundancy if a single peptide is not displayed. Note thatigure 4: SARS-CoV-2 OptiVax-Robust selected peptide vaccine set for (A) MHC class I and (B) MHC class II. (a) EvalVax-Robust population coverage at different per-individual number of peptide-HLA hit cutoffs for populations self-reporting ashaving White, Black, or Asian ancestry and average values. (b) Binding of vaccine peptides to each of the available alleles.(c) Distribution of the peptide origin. (d) Distribution of the number of per-individual peptide-HLA hits in populations self-reporting as having White, Black, or Asian ancestry. (e) Fraction of vaccine peptides that are also peptides of SARS-CoV.OptiVax-Robust also outperforms all baselines on n = 1 coverage and achieve 99.99% (MHC class I) and 99.67%(MHC class II) coverage for n = 1 , suggesting its capabil-ity to cover almost full population at least once while opti-mizing for higher peptide display diversity. We also foundthat OptiVax-Robust outperforms all baselines with an im-munogenicity model that does not incorporate clinical data(Supplement Figure S1). MHC class I results.

We selected an optimized set of pep-tides from all SARS-CoV-2 proteins using OptiVax. We useour MHC class I integrated model of peptide-HLA immuno-genicity for our objective function (Supplemental). After allof our ﬁltering steps we considered 1546 candidate peptides.With OptiVax-Robust optimization, we design a vaccinewith 18 peptides that achieves 99.999% EvalVax-Robustcoverage over populations self-reporting as having Asian,Black, or White ancestry with at least one peptide-HLA hitper individual. This set of peptides also provides 99.53%coverage with at least 5 peptide-HLA hits and 92.32% cov-erage with at least 8 peptide-HLA hits (Figure 4A, Table S1).The population level distribution of the number of peptide-HLA hits in White, Black, and Asian populations is shownin Figure 4A, where the expected number of peptide-HLAhits 12.532, 11.425 and 13.326, respectively.

MHC class II results.

We use our MHC class II modelof peptide-HLA immunogenicity for our objective function(Supplemental). After all of our ﬁltering steps we had 8003candidate peptides. With OptiVax-Robust optimization, wedesign a vaccine with 18 peptides that achieves 99.67%EvalVax-Robust coverage with at least one peptide-HLA hitper individual. This set of peptides also provides 97.22%coverage with at least 5 peptide-HLA hits and 87.99% cov-erage with at least 8 peptide-HLA hits (Figure 4B, Table S1).The population level distribution of the number of peptide-HLA hits per individual in populations self-reporting as hav-ing Asian, Black, or White ancestry is shown in Figure 4B,where the expected number of peptide-HLA hits is 17.206,16.085 and 11.210, respectively.

Conclusion

We introduced the maximum n -times coverage problem,and showed that it is not submodular. We presented a beamsearch algorithm for solving the problem, and used it to pro-duce a peptide vaccine design for COVID-19. We comparedthe resulting optimized peptide vaccine design with 29 otherpublished designs and found that the optimized design pro-vides substantially better population coverage for both MHCclass I and class II presentation of viral peptides. The use of n -times coverage as an objective increases vaccine redun-dancy and thus the probability that one of the presented pep-ides will be immunogenic and produce a T cell response.Our methods can also be used to augment existing vaccinedesigns with peptides to improve their predicted populationcoverage by initializing the algorithm with peptides that arepresent in an existing design. Ethics Statement

This work presents a principled combinatorial design ap-proach for developing peptide vaccines using machine learn-ing methods, and we apply it to develop a novel MHC class Iand II vaccine design for COVID-19. On our metrics, the re-sulting design is predicted to provide better population cov-erage than 29 existing vaccine designs with less than 150peptides. More broadly, the EvalVax and OptiVax methodswe introduce may be used to evaluate and design peptidevaccines against viral infections or cancer to achieve greaterpopulation coverage and robust peptide display per individ-ual over existing methods. The general formulation of themaximum n -times coverage problem may ﬁnd use in otherﬁelds, such as economics. One potential consequence of thiswork is identifying vaccines that do not provide adequatepopulation coverage that could lead to costly failures of vac-cine trials. We aim to mitigate this risk via our principledvaccine formulation objective and conservative approach fordeﬁning the candidate peptide set. Acknowledgements

This work was supported in part by Schmidt Futures, aGoogle Cloud Platform grant, NIH grant R01CA218094,and a C3.AI DTI Award to D.K.G. Ge Liu’s contributionwas made prior to joining Amazon.

Code Availability

An open source implementation of the methods describedcan be found at https://github.com/gifford-lab/optivax.

Author Contributions

G.L., B.C., and D.G. contributed to problem deﬁnition andsolution. G.L. designed and implemented the optimizationprocedure, with advice from B.C. and D.G. G.L., B.C., andD.G. wrote the paper.

Declaration of Interests

Brandon Carter is an employee of Think Therapeutics, andGe Liu is an employee of Amazon. David Gifford and Bran-don Carter have a ﬁnancial interest in Think Therapeutics,Inc.

References

Abdelmageed, M. I.; Abdelmoneim, A. H.; Mustafa, M. I.; Elfadol,N. M.; Murshed, N. S.; Shantier, S. W.; and Makhawi, A. M. 2020.Design of multi epitope-based peptide vaccine against E protein ofhuman 2019-nCoV: An immunoinformatics approach. bioRxiv .Ahmed, S. F.; Quadeer, A. A.; and McKay, M. R. 2020. Prelim-inary identiﬁcation of potential vaccine targets for the COVID-19coronavirus (SARS-CoV-2) based on SARS-CoV immunologicalstudies.

Viruses bioRxiv .Arunachalam, P. S.; Charles, T. P.; Joag, V.; and et al. 2020. T cell-inducing vaccine durably prevents mucosal SHIV infection evenwith lower neutralizing antibody titers.

Nature Medicine .Banerjee, A.; Santra, D.; and Maiti, S. 2020. Energetics based epi-tope screening in SARS CoV-2 (COVID 19) spike glycoprotein byImmuno-informatic analysis aiming to a suitable vaccine develop-ment. bioRxiv .Baruah, V.; and Bose, S. 2020. Immunoinformatics-aided identiﬁ-cation of T cell and B cell epitopes in the surface glycoprotein of2019-nCoV.

Journal of Medical Virology .Bhattacharya, M.; Sharma, A. R.; Patra, P.; and et al. 2020. Devel-opment of epitope-based peptide vaccine against novel coronavirus2019 (SARS-COV-2): Immunoinformatics approach.

Journal ofmedical virology

AISTATS .Bui, H.-H.; Sidney, J.; Dinh, K.; and et al. 2006. Predicting pop-ulation coverage of T-cell epitope-based diagnostics and vaccines.

BMC Bioinformatics

Nature

SIAM Journal on Computing

Nucleic acids research

Antiviral research

Mathematics of Opera-tions Research

GlobalChallenges bioRxiv .Feige, U. 1998. A threshold of ln n for approximating set cover.

Journal of the ACM (JACM)

Cell Host & Microbe

Cell doi:10.1016/j.cell.2020.05.015.Gupta, E.; Mishra, R. K.; and Niraj, R. R. K. 2020. Identiﬁcation ofpotential vaccine candidates against SARS-CoV-2, A step forwardto ﬁght novel coronavirus 2019-nCoV: A Reverse Vaccinology Ap-proach. bioRxiv .upta, R.; Jung, E.; and Brunak, S. 2004. Prediction of N-glycosylation sites in human proteins.

In preparation

Bioinformat-ics

Vaccine .Hu, Z.; Ott, P. A.; and Wu, C. J. 2018. Towards personalized,tumour-speciﬁc, therapeutic vaccines for cancer.

Nature ReviewsImmunology arXiv preprint arXiv:1906.06393 .Jurtz, V.; Paul, S.; Andreatta, M.; Marcatili, P.; Peters, B.; andNielsen, M. 2017. NetMHCpan-4.0: improved peptide–MHC classI interaction predictions integrating eluted ligand and peptide bind-ing afﬁnity data.

The Journal of Immunology

New England Journal of Medicine bioRxiv .Klinger, M.; Pepin, F.; Wilkins, J.; Asbury, T.; Wittkop, T.; Zheng,J.; Moorhead, M.; and Faham, M. 2015. Multiplex identiﬁcationof antigen-speciﬁc T cell receptors using a combination of im-mune assays and immune receptor sequencing.

PLoS One

Discrete applied mathematics

Proceedings 42ndIEEE Symposium on Foundations of Computer Science , 522–528.IEEE.Lee, C. H.-J.; and Koohy, H. 2020. In silico identiﬁcation of vac-cine targets for 2019-nCoV.

F1000Research

Vaccines

Cell Systems

Cell Sys-tems

Proceedings of the 34th International Con-ference on Machine Learning-Volume 70 , 2449–2458. JMLR. org.Mitra, D.; Shekhar, N.; Pandey, J.; Jain, A.; and Swaroop, S. 2020.Multi-epitope based peptide vaccine design against SARS-CoV-2using its spike protein. bioRxiv . Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An anal-ysis of approximations for maximizing submodular set functions-I.

Mathematical programming β ) sequences and binding as-sociations from natural and synthetic exposure to SARS-CoV-2. Research Square .O’Donnell, T. J.; Rubinsteyn, A.; and Laserson, U. 2020.MHCﬂurry 2.0: Improved Pan-Allele Prediction of MHC Class I-Presented Peptides by Incorporating Antigen Processing.

Cell Sys-tems

Nature bioRxiv .Ramaiah, A.; and Arumugaswami, V. 2020. Insights into cross-species evolution of novel human coronavirus 2019-nCoV anddeﬁning immune determinants for vaccine development. bioRxiv .Rist, M. J.; Theodossis, A.; Croft, N. P.; Neller, M. A.; Welland,A.; Chen, Z.; Sullivan, L. C.; Burrows, J. M.; Miles, J. J.; Brennan,R. M.; et al. 2013. HLA peptide length preferences control CD8+T cell responses.

The Journal of Immunology

Frontiers in immunology bioRxiv .Schultheiß, C.; Paschold, L.; Simnica, D.; and et al. 2020. Next-Generation Sequencing of T and B Cell Receptor Repertoires fromCOVID-19 Patients Showed Signatures Associated with Severityof Disease.

Immunity

Cell bioRxiv .Snyder, T. M.; Gittelman, R. M.; Klinger, M.; and et al. 2020. Mag-nitude and Dynamics of the T-Cell Response to SARS-CoV-2 In-fection at Both Individual and Population Levels. medRxiv .Srinivasan, A. 1999. Improved approximation guarantees for pack-ing and covering integer programs.

SIAM Journal on Computing bioRxiv .Tahir ul Qamar, M.; Rehman, A.; Ashfaq, U. A.; Awan, M. Q.;Fatima, I.; Shahid, F.; and Chen, L.-L. 2020. Designing of a nextgeneration multiepitope based vaccine (MEV) against SARS-COV-2: Immunoinformatics and in silico approaches. bioRxiv .ashi, Y.; Jagrit, V.; and Kumar, S. 2020. Understanding the B andT cells epitopes of spike protein of severe respiratory syndromecoronavirus-2: A computational way to predict the immunogens. bioRxiv .Wang, Q.; Qiu, Y.; Li, J.-Y.; Zhou, Z.-J.; Liao, C.-H.; and Ge, X.-Y.2020. A unique protease cleavage site predicted in the spike pro-tein of the novel pneumonia coronavirus (2019-nCoV) potentiallyrelated to viral transmissibility.

Virologica Sinica

Nature Chemical Biology

Cell Systems upplementary Material:Maximum n -times Coverage for COVID-19 Vaccine DesignDetails of Candidate Peptide Filtering For our SARS-CoV-2 peptide vaccine design, we eliminate peptides that are expected to mutate and thus cause vaccine escape,peptides crossing cleavage sites, peptides that may be glycosylated, and peptides that are identical to peptides in the humanproteome.

Removal of mutable peptides.

We eliminate peptides that are observed to mutate above an input threshold rate (0.001) toimprove coverage over all SARS-CoV-2 variants and reduce the chance that the virus will mutate and escape vaccine-inducedimmunity in the future. When possible, we select peptides that are observed to be perfectly conserved across all observedSARS-CoV-2 viral genomes. Peptides that are observed to be perfectly conserved in thousands of examples may be functionallyconstrained to evolve slowly or not at all.For SARS-CoV-2, we obtained the most up to date version of the GISAID database (Elbe and Buckland-Merrett 2017)(as of 2:02pm EST May 13, 2020, acknowledgements in Supplementary table) and used Nextstrain (Hadﬁeld et al. 2018) to remove genomes with sequencing errors, translate the genome into proteins, and perform multiple sequence alignments(MSAs). We retrieved 24468 sequences from GISAID, and 19288 remained after Nextstrain quality processing. After qualityprocessing, Nextstrain randomly sampled 34 genomes from every geographic region and month to produce a representativeset of 5142 genomes for evolutionary analysis. Nextstrain deﬁnition of a “region” can vary from a city (e.g., “Shanghai”) toa larger geographical district. Spatial and temporal sampling in Nextstrain is designed to provide a representative sampling ofsequences around the world.The 5142 genomes sampled by Nextstrain were then translated into protein sequences and aligned. We eliminated viralgenome sequences that had a stop codon, a gap, an unknown amino acid (because of an uncalled nucleotide in the codon), orhad a gene that lacked a starting methionine, except for ORF1b which does not begin with a methionine. This left a total of4690 sequences that were used to compute peptide level mutation probabilities. For each peptide, the probability of mutationwas computed as the number of non-reference peptide sequences observed divided by the total number of peptide sequencesobserved. Removal of cleavage regions.

SARS-CoV-2 contains a number of post-translation cleavage sites in ORF1a and ORF1b thatresult in a number of nonstructural protein products. Cleavage sites for ORF1a and ORF1b were obtained from UniProt (Con-sortium 2019) under entry P0DTD1. In addition, a furin-like cleavage site has been identiﬁed in the Spike protein (Wang et al.2020; Coutard et al. 2020). This cleavage occurs before peptides are loaded in the endoplasmic reticulum for class I or en-dosomes for class II. Any peptide that spans any of these cleavage sites is removed from consideration. This removes 3,739peptides out of the 154,996 we consider across windows 8-10 (class I) and 13-25 (class II) ( ∼ Removal of glycosylated peptides.

Glycosylation is a post-translational modiﬁcation that involves the covalent attachmentof carbohydrates to speciﬁc motifs on the surface of the protein. We eliminate all peptides that are predicted to have N-linkedglycosylation as it can inhibit MHC class I peptide loading and T cell recognition of peptides (Wolfert and Boons 2013). Weidentiﬁed peptides that may be glycosylated with the NetNGlyc N-glycosylation prediction server (Gupta, Jung, and Brunak2004) and eliminated peptides where NetNGlyc predicted a non-zero N-glycosylation probability in any residue. This resultedin the elimination of 18,957 of the 154,996 peptides considered ( ∼ Self-epitope removal.

T cells are selected to ignore peptides derived from the normal human proteome, and thus we removeany self peptides from consideration for a vaccine. In addition, it is possible that a vaccine might stimulate the adaptive immunesystem to react to a self peptide that was presented at an abnormally high level, which could lead to an autoimmune disorder.All peptides from SARS-CoV-2 were scanned against the entire human proteome downloaded from UniProt (Consortium 2019)under Proteome ID UP000005640. A total of 48 exact peptide matches (46 8-mers, two 9-mers) were discovered and eliminatedfrom consideration.

Detailed OptiVax-Robust Vaccine Design Results

Table S1 shows the evaluation of our OptiVax-Robust vaccine designs using the M

ARGINAL G REEDY algorithm compared to29 designs by others as baselines (Lee and Koohy 2020; Fast, Altman, and Chen 2020; Poran et al. 2020; Bhattacharya et al.2020; Baruah and Bose 2020; Abdelmageed et al. 2020; Ahmed, Quadeer, and McKay 2020; Srivastava et al. 2020; Herst et al.2020; Vashi, Jagrit, and Kumar 2020; Akhand, Azim, and et al. 2020; Mitra et al. 2020; Khan et al. 2020; Banerjee, Santra, andMaiti 2020; Ramaiah and Arumugaswami 2020; Gupta, Mishra, and Niraj 2020; Saha and Prasad 2020; Tahir ul Qamar et al. from GitHub commit 639c63f25e0bf30c900f8d3d937de4063d96f791 Evaluation of OptiVax-Robust Vaccine Design and baselines with non-experimentally-calibratedmachine learning predictions

We found that M

ARGINAL G REEDY produced vaccine designs superior to the baselines we tested even when it used immuno-genicity models that did not rely upon clinical data of peptide immunogenicity. Since our immunogenicity model used experi-mental data that was not accessible to the baseline methods, our designs have an advantage over baseline vaccines that did notuse calibrated machine learning predictions. We repeated vaccine design using an ensemble of NetMHCpan-4.0 and MHCﬂurrywith 50 nM predicted afﬁnity cutoff for predicting MHC class I immunogenicity and NetMHCIIpan-4.0 for MHC class II. Asshown in Figure S1, OptiVax-Robust again shows superior performance on all metrics at all vaccine size under 35, indicatingthe success of M

ARGINAL G REEDY in optimizing population coverage with diverse peptide display.Figure S1: EvalVax population coverage evaluation of SARS-CoV-2 vaccines for (A) MHC class I and (B) MHC class II usingnon-experimentally calibrated machine learning predictions. (a) EvalVax-Robust population coverage with n ≥ peptide-HLAhits per individual, OptiVax-Robust performance is shown by the blue curve and baseline performance is shown by red crosses(labeled by name of ﬁrst author). (b) EvalVax-Robust population coverage with n ≥ peptide-HLA hits. (c) EvalVax-Robustpopulation coverage with n ≥ peptide-HLA hits. (d) Comparison of OptiVax-Robust and baselines on expected number ofpeptide-HLA hits. eptide Set VaccineSize EvalVax-Robust p ( n ≥ EvalVax-Robust p ( n ≥ EvalVax-Robust p ( n ≥ Exp.

MHC Class I Peptide Vaccine Evaluation

OptiVax-Robust (Ours)

18 100 . % . % . % .

428 12 .

532 11 .

425 13 . S-protein . % . % . % .

183 38 .

212 38 .

250 44 . S1-subunit . % . % . % .

788 20 .

616 20 .

397 21 . Srivastava et al. (2020)

37 99 . % . % . % .

163 7 .

533 7 .

018 6 . Herst et al. (2020)

52 97 . % . % . % .

719 5 .

207 4 .

473 4 . Random subset of binders

18 97 . % . % . % .

851 7 .

007 6 .

028 7 . Herst et al. (2020)-top16

16 96 . % . % . % .

931 3 .

236 2 .

988 2 . Baruah and Bose (2020) . % . % . % .

565 3 .

128 2 .

031 2 . Fast, Altman, and Chen (2020)

13 89 . % . % . % .

558 5 .

616 4 .

386 6 . Vashi, Jagrit, and Kumar (2020)

51 84 . % . % . % .

802 1 .

976 1 .

711 1 . Abdelmageed et al. (2020)

10 79 . % . % . % .

000 2 .

192 1 .

777 2 . Lee and Koohy (2020)

13 75 . % . % . % .

113 4 .

397 3 .

384 4 . Ahmed, Quadeer, and McKay (2020)

16 73 . % . % . % .

085 3 .

051 2 .

248 3 . Poran et al. (2020)

10 72 . % . % . % .

355 0 .

997 1 .

383 1 . Singh et al. (2020) . % . % . % .

499 1 .

431 1 .

305 1 . Akhand, Azim, and et al. (2020)

31 61 . % . % . % .

840 1 .

062 0 .

774 0 . Bhattacharya et al. (2020)

13 53 . % . % . % .

710 0 .

977 0 .

635 0 . Khan et al. (2020) . % . % . % .

614 0 .

702 0 .

878 0 . Saha and Prasad (2020) . % . % . % .

423 0 .

496 0 .

291 0 . Gupta, Mishra, and Niraj (2020) . % . % . % .

804 0 .

747 0 .

388 1 . Mitra et al. (2020) . % . % . % .

355 0 .

504 0 .

251 0 . MHC Class II Peptide Vaccine Evaluation

OptiVax-Robust (Ours)

18 99 . % . % . % .

834 17 .

206 16 .

085 11 . S-protein . % . % . % .

096 541 .

429 416 .

406 227 . Ramaiah and Arumugaswami (2020)

134 98 . % . % . % .

743 45 .

044 38 .

254 17 . S1-subunit . % . % . % .

838 265 .

183 175 .

615 92 . Random subset of binders

18 93 . % . % . % .

805 7 .

461 6 .

367 3 . Fast, Altman, and Chen (2020)

13 86 . % . % . % .

560 3 .

650 2 .

262 1 . Banerjee, Santra, and Maiti (2020) . % . % . % .

398 3 .

162 2 .

354 1 . Tahir ul Qamar et al. (2020)

11 72 . % . % . % .

278 1 .

840 1 .

464 0 . Poran et al. (2020)

10 69 . % . % . % .

983 1 .

469 0 .

910 0 . Akhand, Azim, and et al. (2020)

31 60 . % . % . % .

886 2 .

531 2 .

536 0 . Singh et al. (2020) . % . % . % .

981 1 .

438 1 .

113 0 . Ahmed, Quadeer, and McKay (2020) . % . % . % .

654 0 .

736 0 .

717 0 . Mitra et al. (2020) . % . % . % .

657 0 .

905 0 .

579 0 . Vashi, Jagrit, and Kumar (2020)

20 35 . % . % . % .

673 0 .

959 0 .

618 0 . Abdelmageed et al. (2020)

10 28 . % . % . % .

479 0 .

919 0 .

274 0 . Baruah and Bose (2020) . % . % . % .

000 0 .

000 0 . Table S1: Comparison of existing baselines, S-protein peptides, and OptiVax-Robust peptide vaccine designs on various popu-lation coverage evaluation metrics. The rows are sorted by EvalVax-Robust p ( n ≥ . Random subsets are generated 200 times.The binders used for generating random subsets are deﬁned as peptides that are predicted to bind with ≤≤