[PDF] HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data

Abstract

Database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Existing serial, and high-performance computing (HPC) search engines, otherwise highly successful, are known to exhibit poor-scalability with increasing size of theoretical search-space needed for increased complexity of modern non-model, multi-species MS-based omics analysis. Consequently, the bottleneck for computational techniques is the communication costs of moving the data between hierarchy of memory, or processing units, and not the arithmetic operations. This post-Moore change in architecture, and demands of modern systems biology experiments have dampened the overall effectiveness of the existing HPC workflows. We present a novel efficient parallel computational method, and its implementation on memory-distributed architectures for peptide identification tool called HiCOPS, that enables more than 100-fold improvement in speed over most existing HPC proteome database search tools. HiCOPS empowers the supercomputing database search concept for comprehensive identification of peptides, and all their modified forms within a reasonable time-frame. We demonstrate this by searching Gigabytes of experimental MS data against Terabytes of databases where HiCOPS completes peptide identification in few minutes using 72 parallel nodes (1728 cores) compared to several weeks required by existing state-of-the-art tools using 1 node (24 cores); 100 minutes vs 5 weeks; 500x speedup. Finally, we formulate a theoretical framework for our overhead-avoiding strategy, and report superior performance evaluation results for key metrics including execution time, CPU utilization, speedups, and I/O efficiency. The software will be made available at: hicops.github.io

Full PDF

HHiCOPS: High Performance ComputingFramework for Tera-Scale Database Search ofMass Spectrometry based Omics Data

Muhammad Haseeb ∗ , Fahad Saeed ∗ Abstract

Database-search algorithms, that deduce peptides from Mass Spec-trometry (MS) data, have tried to improve the computational eﬃciency to accomplish larger, and more complex systems biology studies. Existingserial, and high-performance computing (HPC) search engines, otherwisehighly successful, are known to exhibit poor-scalability with increasingsize of theoretical search-space needed for increased complexity of mod-ern non-model, multi-species MS-based omics analysis. Consequently,the bottleneck for computational techniques is the communication costsof moving the data between hierarchy of memory, or processing units,and not the arithmetic operations. This post-Moore change in architec-ture, and demands of modern systems biology experiments have damp-ened the overall eﬀectiveness of the existing HPC workﬂows. We presenta novel eﬃcient parallel computational method, and its implementationon memory-distributed architectures for peptide identiﬁcation tool calledHiCOPS, that enables more than 100-fold improvement in speed over mostexisting HPC proteome database search tools. HiCOPS empowers the su-percomputing database search concept for comprehensive identiﬁcationof peptides, and all their modiﬁed forms within a reasonable time-frame.We demonstrate this by searching Gigabytes of experimental MS dataagainst Terabytes of databases where HiCOPS completes peptide identi-ﬁcation in few minutes using 72 parallel nodes (1728 cores) compared toseveral weeks required by existing state-of-the-art tools using 1 node (24cores); 100 minutes vs 5 weeks; 500 × speedup. Finally, we formulate atheoretical framework for our overhead-avoiding strategy, and report su-perior performance evaluation results for key metrics including executiontime, CPU utilization, speedups, and I/O eﬃciency. We also demonstratesuperior performance as compared to all existing HPC strategies. Keywords— mass spectrometry, proteomics, peptide identiﬁcation, bulk syn-chronous parallel, high performance computing ∗ School of Computing, and Information Sciences, Florida International University (FIU),Miami, FL USA; Corresponding Author. Email: fsaeed@ﬁu.edu a r X i v : . [ c s . D C ] F e b ain Faster, and more eﬃcient peptide identiﬁcation algorithms [1] [2] [3] have been the cor-nerstone of computational research in shotgun MS based proteomics for more than 30years [4, 5, 6, 7, 3, 8, 9, 10, 2, 11, 12, 13, 14, 15, 16, 17]. Millions of raw, noisy spectracan be produced, in a span of few hours, using modern mass spectrometry technolo-gies producing several gigabytes of data [18] (Supplementary Fig. 1). Database peptidesearch is the most commonly employed computational approach to identify the pep-tides from the experimental spectra [19], [10], [2], [20]. In this approach, the experimen-tal spectra are searched against a database of model-spectra (or theoretical-spectra)with the goal to ﬁnd the best possible matches [1]. The model-spectra database issimulated through in-silico techniques using a proteome sequence database (Supple-mentary Fig. 2). The model-spectra database can grow exponentially in space (severalgiga to terabytes) as the post-translational modiﬁcations (PTMs) are incorporated insimulation [2], [21]. Therefore, the cost of moving, and managing this data to matchwith the spectra now exceeds the costs of doing the arithmetic operations in thesesearch engines leading to non-scalable workﬂows with increasingly larger, and com-plex data sets [22].As demonstrated by other big data ﬁelds [23], such limitations can be reduced bydeveloping parallel algorithms that combine the computational power of thousands ofprocessing elements across distributed-memory clusters, and supercomputers. We, andothers have developed high-performance computing (HPC) techniques for processingof MS data including for multicore [3], [2], [10], [9], and distributed-memory architec-tures [24], [25] [26], [27], [28], [29]. Similar to serial algorithms, the objective of theseHPC methods has been to speed up the arithmetic scoring part of the search engines,by spawning multiple (managed) instances of the original code, replicating the theoret-ical database, and splitting the experimental data. However, computationally optimalHPC algorithms that minimize both the computational and communications costs forthese tasks are still needed. Urgent need for developing methods that exhibit optimalperformance is illustrated in our theoretical framework [22], and can potentially leadto large-scale systems biology studies especially for meta-proteomics, proteogenomic,and MS based microbiome or non-model organisms’ studies having direct impact onpersonalized nutrition, microbiome research, and cancer therapeutics.In our quest to develop faster strategies applicable to MS based omics data analysis,we designed a novel HPC framework that provides orders-of-magnitude faster process-ing over both serial, and parallel tools. We implemented this framework in a newHPC tool, capable of scaling on large (distributed) symmetric multiprocessor (SMP)supercomputers, called

HiCOPS . HiCOPS makes searches possible (in few minutes)even for tera-byte level theoretical database(s); something not feasible (several weeks of computations) with existing state-of-the-art methods. We demonstrate HiCOPS’sutility in both closed- and open-searches across diﬀerent search-parameters, and exper-imental conditions. Further, our experimental results depict more than 100 × speedupfor HiCOPS compared to several existing shared and distributed memory databasepeptide search tools. HiCOPS is overhead-avoiding strategy that splits the database(algorithmic workload) among the parallel processes in a load balanced fashion, exe-cutes the partial database peptide search, and merges the results in communicationoptimal way thereby, alleviating the resource upper bounds that exist in the currentgeneration of database peptide search tools.We demonstrate HiCOPS results on several data- and compute-intensive experi-mental conditions including using 4TB of theoretical database against which millions f spectra were matched. We show that HiCOPS even when using similar scoringfunctions outperforms both parallel, and serial methods. Although not a fair com-parison, one of our experiments of searching 41GB experimental spectra against adatabase size of 1.8TB ran in only 103.5 minutes using 72 parallel nodes compared toMSFragger which took about 35.5 days to complete the same experiment on 1 node(494 × slower). HiCOPS completed an open-search (dataset size: 8K spectra, databasesize: 93.5M spectra) in 144 seconds as compared to the X!!Tandem (33 minutes) andSW-Tandem (4.2 hours), all using 64 parallel nodes; demonstrating that HiCOPS wasalso out-performing existing parallel tools. We designed 12 diﬀerent experiment setsand demonstrate the performance our parallel computing framework. Our extensivelyevaluated HiCOPS parallel performance using metrics such as parallel eﬃciency: 70-80%, load imbalance cost: ≤ ≤ ≤

5% and task scheduling related costs: ≤ Results

HiCOPS constructs the parallel database peptide search algorithmic workﬂow (task-graph) using four

Single Program Multiple Data (SPMD) Bulk Synchronous Parallel(BSP) [30] supersteps; where a set of processes ( p i (cid:15) P ) execute ( φ ) supersteps inasynchronous parallel fashion and synchronize between them. As shown in Fig 1,HiCOPS allows searching of partial theoretical database, in parallel; something thathas not been accomplished in the context of peptide database-search tools. Thesepartial search-results are then merged using a communication-optimal technique.In the ﬁrst superstep, the massive model-spectra database is partitioned acrossparallel processes in a load balanced fashion. In the second superstep, the experimen-tal data are divided into batches and pre-processed if required. In the third superstep,the parallel processes execute a partial database peptide search on the pre-processedexperimental data batches, producing intermediate results. In the ﬁnal superstep,these intermediate results are de-serialized and assembled into complete (global) re-sults. The statistical signiﬁcance scores are computed (Online Methods, Fig. 1) usingglobal results. Fig. 2 gives an overview of the parallelization scheme, task-graph, andworkload proﬁle for each of the HiCOPS’ supersteps (Online Methods).The total wall time ( T H ) for executing the four supersteps is the sum of superstepexecution times, given as: T H = T + T + T + T Where the execution time for a superstep ( j ) is the maximum time required byany parallel task ( p i (cid:15) P ) to complete that superstep, given as: T j = max ( T j,p , T j,p , ..., T j,p P )Or simply: T j = max p i ( T j,p i ) ombining the above three equations, the total HiCOPS runtime is given as: (1) T H = (cid:88) j =1 max p i ( T j,p i ) Figure 1: (a) Superstep 1:

The massive model-spectra database (shown asshapes) is partitioned among parallel MPI processes in load balanced mannerand then locally indexed. (b) Superstep 2:

The experimental MS/MS spectradata are split, indexed, tagged, pre-processed and written back to the ﬁle sys-tem in parallel. (c) Superstep 3:

The partial database peptide search pipelineexecuted by all parallel processes is shown. On each process, three parallel sub-tasks R , I and K work in producer-consumer pipeline to load the pre-processeddata, execute the partial database search producing partial results, and writethe (sampled) results to the shared memory respectively. The available threadsare managed between parallel sub-tasks through a task scheduling algorithm.The sub-tasks communicate via buﬀer queues to avoid fragmentation. (d) Su-perstep 4: The partial results are assembled into complete results to computestatistical scores which are communicated to their origin processes.

Experimental Setup

We used the following datasets from Pride Archive for experimentation and evaluationpurposes. • E : PXD009072 (0.305 million spectra) • E : PXD020590 (1.6 million spectra) Workload Proﬁle:

Supersteps 1 and 2 are designed as data parallel.Supersteps 3 and 4 are designed as hybrid task and data parallel. The workloadexecuted by the four respective supersteps are compute intensive, I/O intensive,mixed (compute and I/O), and mixed (compute and comm.). In the last twosupersteps, the compute workload may supersede the communication and/orI/O, given that the associated overhead costs are overlapped or minimized. • E : PXD015890 (3.8 million spectra) • E : PXD007871, 009072, 010023, 012463, 013074, 013332, 014802, and 015391combined (1.515 million spectra) • E : All above datasets combined (6.92 million spectra)The search experiments were conducted against the following protein sequencedatabases. The databases were digested in-silico using Trypsin as enzyme with 2 al-lowed missed cleavages, peptide lengths between 6 and 46 and peptide masses between500 and 5000Da. The number and type of PTMs added to the database, and thepeptide precursor mass tolerance ( δM ) were varied across experiments however, thefragment mass tolerance ( dF ) was set to ± • D : UniProt Homo sapiens (UP005640) • D : UniProt SwissProt (reviewed, multi-species )Furthermore, we designed 12 diﬀerent experiments ( e n ) using combinations of theabove mentioned databases, datasets and experimental parameters for our extensiveperformance evaluation. These experiments exhibit varying experimental workloads tocover a wide-range of real-world scenarios. We represent each of these experiment setsusing a tuple: e n = ( q, D, δM ) where q is dataset size in 1 million spectra, D is model-spectra database size in 100 million spectra and δM peptide precursor mass settingin ± e =(0.3, 0.84, 0.1), e =(0.3, 0.84, 2), e =(3.89, 0.07, 5), e =(1.51, 2.13, 5), e =(6.1, 0.93, 5), e =(3.89, 7.66, 5), e =(1.51, 19.54, 5), e =(1.6,38.89, 5), e =(3.89, 15.85, 5), e =(3.89, 1.08, 5), e =(1.58, 2.13, 1), and e =(0.305,0.847, 5). Runtime Environment:

All distributed memory tools were run on the ExtremeScience and Engineering Discovery Environment (XSEDE) [31] Comet cluster at the an Diego Supercomputer Center (SDSC). All Comet compute nodes are equippedwith 2 sockets ×

12 cores of Intel Xeon E5-2680v3 processor, 2 NUMA nodes × >

48 hours (job time limit on XSEDE Comet).

Correctness of the parallel design

We evaluated the correctness of our parallel design by searching all ﬁve datasets E i against both protein sequence databases D i under various settings, and combinationsof PTMs. The correctness was evaluated in terms of of consistency in the number ofdatabase hits, the identiﬁed peptide to spectrum matches (PSM), and the hyper-scoresand e-values assigned to those sequences (within 3 decimal points) for each experi-mental spectrum searched. The experiments were performed using combinations ofexperimental settings where we observed more than 99.5% consistent results regard-less of the number of parallel nodes. The negative error in expected values resultsobserved in erroneous identiﬁcations was caused by the sampling, and ﬂoating-pointprecision losses (Online Methods, Fig. 5, Fig. 1d). A snippet of the 251,501 pep-tide to spectrum match (PSM) results obtained by searching the dataset: E againstthe database: D with no post-translational modiﬁcations added at precursor masstolerance: δM = ± Comparative analysis reveals orders of magnitude speedups

We compared the HiCOPS speed against many existing shared and distributed memoryparallel database peptide search algorithms including MSFragger v3.0 [2], X! Tandemv17.2.1 [32], Tide/Crux v3.2 [3], X!! Tandem v10.12.1 [25], and SW-Tandem [28].In the ﬁrst experiment set, a subset of 8,000 spectra (ﬁle: 7Sep18 Olson WT24) fromdataset: E was searched against the database: D . Fixed Cysteine Carbamidomethy-lation, and variable Methionine oxidation, and Tyrosine Biotin-tyramide were addedyielding model-spectra database of 93.5 million ( ∼ E was searched against the same database D . The pep-tide precursor mass tolerance was in both sets was ﬁrst set to: δM = ± ± ± > × when the experiment size is large. For instance,for the second experiment, HiCOPS outperforms both the X!!Tandem and SW-Tandemby > × (230seconds v > × and 350 × versus MSFragger (1 node) for the same experiment set. Furthermore,we observed no speedups for SW-Tandem with increasing number of parallel nodes (noparallel eﬃciency). We repeatedly contacted the corresponding authors about the par-allel eﬃciency issue but did not receive a response (Supplementary Text 6). Application in large-scale peptide identiﬁcation

The application of HiCOPS in extremely resource intensive experimental settings wasdemonstrated using additional experiments where the datasets: E , E and E were earched against model-spectra databases of sizes: 766M (780GB), 1.59B (1.7TB) and3.88B (4TB) respectively ( δM = ± E anddatabase size: 1.59B (1.7TB)) on MSFragger which completed after 35.5 days makingHiCOPS 494 × faster. The rest of the experiments were intentionally not run using anyother tools but HiCOPS to avoid feasibility issues as each tool would require severalmonths of processing to complete each experiment, as evident from SupplementaryTables 3, 4, 5. The wall clock execution time results for this set of experiments aresummarized in the Table 1. Table 1: Experimental wall times for large-scale peptide identiﬁcation ex-periments using HiCOPS and MSFragger. Exp. 1: (DB: 766M, DS: 3.8M, δM =500Da), Exp. 2: (DB: 1.6B, DS: 1.5M, δM =500Da), Exp. 3: (DB: 3.88B,DS: 1.6M, δM =500Da). The experiments were not performed using other toolsdue to their relatively slower speeds requiring several months of processing pertool per experiment. Exp.Num Tool Nodes Dataset(GB) Database(GB) Time(min)

HiCOPS exhibits eﬃcient strong-scale speedups

The speedup and strong scale eﬃciency for the overall and superstep-by-superstepruntime was measured for all 12 experiment sets. The results (Supplementary Fig. 4,Fig 3a, b) depict that the overall strong scale eﬃciency closely follows the superstep3 (evident in Supplementary Fig. 4) and ranges between 70-80% for suﬃciently largeexperimental workload. Super-linear speedups were also observed in many experimentswith higher workloads. To explain this, the following hardware counters-based metricswere also recorded for all experiment sets: instructions per cycle ( ipc ), last level cachemisses per all cache level misses ( lpc ), and the cycles stalled due to writes per totalstalled cycles ( wps ). The results (Fig. 3c) show that the CPU, cache, and memorybandwidth utilization improves as the workload per node ( wf/P ) increases reachingto an optimum point after which it saturates due to memory bandwidth contentionsince the database search algorithms employed (and also in general) are highly memoryintensive. Beyond this saturation point, increasing the number of parallel nodes forthe same experimental workload resulted in a substantial improvement (super-linear)in performance as the workload per node ( wf/P ) reduced to the normal (optimal)range. For instance, the experiment set e depicts super-linear speedups (Fig. 3a)which can be correlated to the hardware performance surge in Fig. 3c. (a, b) The speedup and parallel eﬃciency improve as the experimentalworkload increases ranging between 70-80% for most experiment sets. (c)

Thehardware utilization metrics show an improved performance per node trend forlarge workloads as the number of parallel nodes increase resulting in super-linearspeedups (e.g. e ). Performance evaluation reveals minimal overhead costs

The load imbalance, communication, I/O, and task scheduling costs were measuredfor all experiment 12 sets. The obtained results (Fig. 4a, b, c) depict that the loadimbalance costs remain ≤ ≤ ≤

10% in most experiments. Note that the load imbalance is a direct measure ofsynchronization cost. The task-scheduling cost was measured through a time series( t wait ) (Fig. 4e) which monitors the time that the parallel cores had to wait for thedata I/O to complete. The results (Fig. 4e) depict that our task-scheduling algorithmactively performs counter measures (reallocates threads) as soon as a surge in wait-time is detected keeping the cost to ≤

5% in most experiments (Fig. 4d). We alsoobserved that the I/O cost is aﬀected by a number of factors including average datasetﬁle size, number of ﬁles in the dataset and the available ﬁle system bandwidth. Thecommunication cost is aﬀected by the available network bandwidth.

Discussion

Enormous possibilities of chemical and biological modiﬁcations add knowledge discov-ery dimensions to mass spectrometry-based omics but are not explored in most studies,in part, due to the scalability challenges associated with comprehensive PTM searches.Current MS based computational proteomics algorithms, both serial and parallel, havefocused on improving arithmetic computations by introducing indexing and approxi-mation methods to speedup their workﬂows. However recent trends in the workloadsstemming from systems biology (e.g. meta-proteomics, proteogenomics) experiments (a)

The load imbalance overhead costs remain under 10% in mostexperiment sets. (b)

The communication overhead costs remain under 5% inmost experiments sets. (c)

The I/O overheads remain under 10% in mostexperiment sets but there is an upward trend as the number of parallel nodesincrease. This occurs due to the saturation of the shared ﬁle system bandwidth. (d)

The scheduling costs remain under 5% for most experiment sets. Thescheduling costs may increase if the workload per node is extremely small. (e)

The time series shows that the task scheduling algorithm eﬃciently redistributesthe parallel threads as soon as a surge in cost is detected.9 oint towards urgent need for computational tools, capable of eﬃciently harnessingthe compute and memory resources from supercomputers. Our highly scalable andlow-overhead strategy, HiCOPS, meets this urgent need for next generation of com-putational solutions leading to more comprehensive peptide identiﬁcation application.Further, our HPC framework can be adapted for accelerating most existing moderndatabase peptide search algorithms.We demonstrate using never before done experiments, peptide deduction throughsearching gigabytes of experimental MS/MS data against terabytes of model-spectradatabases in only a few minutes compared to several days required by modern tools(100 minutes vs 5 weeks; 500 × speedup using 72 parallel nodes). Our overhead-avoiding BSP-model based parallel algorithmic design allows eﬃcient exploitationof extreme-scale resources available in modern high-performance computing architec-tures, and supercomputers. Extensive performance evaluation using over two dozen ex-periment sets with variable problem size (database and dataset sizes) and experimentalsettings revealed superior strong scale parallel eﬃciency, and minimal overhead costsfor HiCOPS. HiCOPS is a novel HPC framework that gives systems biologist a toolto perform comprehensive modiﬁcation searches for meta-proteomics, proteogenomic,and proteomics studies for non-model organisms at scale. HiCOPS is under directdevelopment and will update with improved I/O eﬃciency, load balancing, reducedoverhead costs, and the parallel design for heterogeneous and CPU-GPU architecturesin future versions. Therefore, we believe that the peptide search strategy (both open-and closed) for comprehensive PTM’s, made practical by HiCOPS, has the potentialto become a valuable option for scalable analysis of shotgun Mass Spectrometry basedomics. nline Methods Notations and Symbols

For the rest of the paper, we will denote the number of peptide sequences in thedatabase as ( ζ ), average number of post-translational modiﬁcations (PTMs) per pep-tide sequence as ( m ), the total database size as ( ζ (2 m ) = D ), the number of parallelnodes/processes as ( P ), number of cores per parallel process as ( c p i ), size of experi-mental MS/MS dataset (i.e. number of experimental/query spectra) as ( q ), averagelength of query spectrum as ( β ), and the total dataset size as ( qβ ). The runtime ofexecuting the superstep ( j ) by parallel task ( p i ) will be denoted as ( T j,p i ) and thegeneric overheads due to boilerplate code, OS delays, memory allocation etc. will becaptured via ( γ p i ). Runtime Cost Model

Since the HiCOPS parallel processes run in SPMD fashion, the cost analysis for anyparallel process (with variable input size) is applicable for the entire system. Also, theruntime cost for a parallel process ( p i (cid:15) P ) to execute superstep ( j ) can be modeledby only its local input size (i.e. database and dataset sizes) and available resources(i.e. number of cores, memory bandwidth). The parallel processes may execute thealgorithmic work in a data parallel, task parallel or a hybrid task and data parallelmodel. As an example, the execution runtime (cost) for a parallel process p i to executesuperstep ( j ) which ﬁrst generates D model-spectra using algorithm k and then sortsthem using algorithm k in data parallel fashion (using all c p i cores) will be given asfollows: T j,p i = k j ( D ) + k j ( D ) + γ p i (2)Similarly, if the above steps k z are performed in a hybrid task and data parallelfashion, the number of cores allocated to each ( k jz ) must also be considered. Forinstance, in the above example, if the two algorithmic steps are executed in sub-taskparallel fashion with c p i / T j,p i = max ( k j ( D, c p i / , k j ( D, c p i / γ p i (3)For analysis purposes, if the time complexity of the algorithms used for step k jz isknown (say O ( . )), we will convert it into a linear function k (cid:48) jz with its input data sizemultiplied by its runtime complexity. This conversion will allow better quantiﬁcationof serial and parallel runtime portions as seen in later sections. As an example, if itis known that the sorting algorithms used for k j have time complexity: O ( N log N ),the equation 2 can be modiﬁed to: T j,p i = k j ( D ) + k (cid:48) j ( D log D ) + γ p i (4) Remarks:

The formulated model will be used to analyze the runtime cost foreach superstep, quantify the serial, parallel and overhead costs in the overall design,and optimize the overheads.

Superstep 1: Partial Database Construction

In this superstep, the HiCOPS parallel processes construct a partial database by exe-cuting the following three algorithmic steps in data parallel fashion (Fig. 2): . Generate the whole peptide database and extract a (load balanced) partition.2. Generate the model-spectra data from the local peptide database partition.3. Index the local peptide and model-spectra databases (fragment-ion index).The database entries are generated and partitioned through the LBE algorithm [33]supplemented with a new distance metric called Mod Distance (∆ m ). The proposed∆ m separates the pairs of database entries based on the edit locations if they have thesame Edit Distance (∆ e ) (See Supplementary Text 3). The reason for supplementingLBE with the new distance metric is to construct identical (load balanced) databasepartitions [33] at parallel HiCOPS processes. Supplementary Fig. 3 illustrates thegeneric LBE algorithm, which to the best of our knowledge, is the only existing tech-nique for eﬃcient model-spectra database partitioning. For a pair of peptide databaseentries ( x, y ), assuming the sum of unedited letters from both sequence termini is ( a ),the Mod Distance (∆ m ) is given as:∆ m ( x, y ) = 2 − amax ( len ( x ) , len ( y )) Cost Analysis:

The ﬁrst step generates the entire database of size ( D ) andextracts a partition (of roughly the size D/P = D p i ) in runtime: k ( D ). The secondstep generates the model-spectra from the partitioned database using standard thealgorithms [12] in runtime: k ( D p i ). The third step constructs a fragment-ion indexsimilar to [2], [34], [21] in runtime: O ( N log N ). In our implementation, we employedthe CFIR-Index [21] indexing method due to its smaller memory footprint. Thisresults in time k (cid:48) ( D p i log D p i ) for the indexing step. Collectively, the runtime forthis superstep is given by Equation 5. (5) T = max p i ( k ( D ) + k ( D p i ) + k (cid:48) ( D p i log D p i ) + γ p i ) Remarks:

Equation 5 depicts that the serial part of execution time i.e. k ( D )limits the parallel eﬃciency of superstep 1. However, using simpler but faster databasepartitioning may result in imbalanced partial databases leading to severe performancedeprecation. Superstep 2: Experimental MS/MS Data Pre-processing

In this superstep, the HiCOPS parallel processes pre-process a partition of experimen-tal MS/MS spectra data by executing the following three algorithmic steps in dataparallel fashion (Fig. 2):1. Read the dataset ﬁles, create a batch index and initialize internal structures.2. Pre-process (i.e. normalize, clear noise, reconstruct etc.) a partition of experi-mental MS/MS data.3. Write-back the pre-processed data.The experimental spectra are split into batches such that a reasonable parallelgranularity is achieved when these batches are searched against the database. Bydefault, the maximum batch size is set to 10,000 spectra and the minimum numberof batches per dataset is set to P . The batch information is indexed using a queue Currently only the b- and y-ions are generated. nd a pointer stack to allow quick access to the pre-processed experimental data inthe superstep 3. Cost Analysis:

The ﬁrst step for reads the entire dataset (size: qβ ) and createsa batch index in runtime: k ( qβ ). The second step may pre-process a partition of thedataset (of roughly the size: qβ/P = Q p i ) using a data pre-processing algorithm suchas [35], [5], [36] etc. in runtime: k ( Q p i ). The third step may write the pre-processeddata back to the ﬁle system in runtime: k ( Q p i ). Note that the second and thirdsteps may altogether be skipped in subsequent runs or in case when the pre-processedspectra data are available. Collectively, the runtime for this superstep is given byEquation 6. (6) T = max p i ( k ( qβ ) + k ( Q p i ) + k ( Q p i ) + γ p i ) Remarks:

Equation 6 depicts that the parallel eﬃciency of superstep 2 is highlylimited by its dominant serial portion i.e. k ( qβ ). Moreover, this superstep is sensitiveto the ﬁle system bandwidth since large volumes of data may need to be read fromand written to the shared ﬁle system. Superstep 3: Partial Database Peptide Search

This is the most important superstep in HiCOPS workﬂow and is responsible for 80-90% of the database peptide search algorithmic workload in real world experiments. Inthis superstep, the HiCOPS parallel processes search the pre-processed experimentalspectra against their partial databases by executing the following three algorithmicsteps in a hybrid task and data parallel fashion (Fig. 2):1. Load the pre-processed experimental MS/MS data batches into memory.2. Search the loaded spectra batches against the (local) partial database and pro-duce intermediate results.3. Serialize and write the intermediate results to the shared ﬁle system assigningthem unique tags.Three parallel sub-tasks are created, namely R , I and K , to execute the algorith-mic work in this superstep in a producer-consumer pipeline. The data ﬂow betweenparallel sub-tasks is handled through queues to create a buﬀer between producers andconsumers. The ﬁrst sub-task ( R ) loads batches of the pre-processed experimentalspectra data and puts them in queue ( q f ) as depicted in Supplementary Algorithm 2.The sub-task R may also perform minimal computations on the experimental spectrabefore putting them in queue. e.g. select only the ( B ) most-intense peaks from theexperimental spectra. The parallel cores assigned to sub-task R are given by: | r | .The second sub-task ( I ) extracts batches from ( q f ), performs the database peptidesearch against its local database partition and puts the produced intermediate resultsin queue ( q k ) depicted in Supplementary Algorithm 3. The parallel cores assigned tosub-task I are given by: | i | . The sub-task I also recycles the memory buﬀers back tosub-task R for reuse, using the queue ( q r ). The last sub-task ( K ) serializes and writesthe intermediate results to the shared ﬁle system (or shared memory if available) using | k | cores. Fig. 1c illustrates the pipeline setup in this superstep. Cost Analysis:

The sub-task ( R ) reads the experimental data batches in runtime: k ( qβ ). The sub-task ( I ) iteratively ﬁlters the partial database using multiple criteriafollowed by formal spectral comparisons (or scoring). Most commonly, the databasepeptide search algorithms use two or three database ﬁltration steps such as peptideprecursor mass tolerance [3], [28], shared fragment-ions [2], [34] and sequence tags [10] k ( qD p i )+ k ( qβα p i ) respectively. Here, the α p i represents the average ﬁltereddatabase size ﬁltered from the ﬁrst step. The formal experimental spectrum to model-spectra comparisons (spectral comparisons) are performed using scoring methods suchas cross-correlation [12], hyperscore [13] etc. in runtime: k ( qβσ p i )+ k ( qµ p i ). Here,the σ p i and µ p i represent the average number of ﬁltered shared-ions and model-spectraper experimental spectrum. Finally, the sub-task K writes the partial results to theshared ﬁle system in runtime: k ( q ). Overhead Costs:

Multiple runtime overheads stemming from load imbalance,producer-consumer speed mismatch, ﬁle system bandwidth congestion can aﬀect theperformance of this superstep. Therefore, it is important to capture them using anadditional runtime cost: V p i ( q, D p i , P ). The optimizations implemented to alleviatethese overhead costs in superstep 3 include buﬀering, task scheduling, load balancingand data sampling (discussed in later sections). Collectively, the runtime for thissuperstep is given by Equation 10.The runtime of sub-task R , i.e. t p i ( r, | r | ), is given as: t p i ( r, | r | ) = k ( qβ, | r | ) (7)The runtime of sub-task I , i.e. t p i ( i, | i | ), is given as: t p i ( i, | i | ) = k ( qD p i , | i | ) + k ( qβα p i , | i | ) + k ( qβσ p i ) + k ( qµ p i , | i | )Or: t p i ( i, | i | ) = k (cid:48) ( q log( D p i ) , | i | ) + k (cid:48) ( qβ log( α p i ) , | i | )+ k ( qβσ p i , | i | ) + k ( qµ p i , | i | ) (8)The runtime of sub-task K , i.e. t p i ( k, | k | ), is given as: t p i ( k, | k | ) = k ( q, | k | ) (9)Combining equations 7, 8 and 9 we have: T = max p i ( max ( t p i ( r, | r | ) , t p i ( i, | i | ) , t p i ( k, | k | ))+ V p i ( q, D p i , P ) + γ p i ) (10) Remarks:

Equations 7, 8, 9 and 10 depict that the parallel runtime portion ofthis superstep grows quadratically superseding the (small) serial portions capable ofnear ideal parallel eﬃciency if the overheads are eliminated.

Superstep 4: Result Assembly

In this superstep, the HiCOPS parallel processes assemble the intermediate resultsfrom the last superstep into complete results by executing the following algorithmicsteps in a hybrid task and data parallel fashion (Fig. 2d):1. Read a set of intermediate result batches, assemble them into complete results,and send the assembled results to their parent processes.2. Receive complete results from other parallel processes and synchronize commu-nication.3. Write the complete results to the ﬁle system. wo parallel sub-tasks are created to execute the algorithmic steps in this super-step. The ﬁrst sub-task reads sets of intermediate results from the shared ﬁle system(or shared memory) (satisfying: tag % p i = 0; p i (cid:15) MPI ranks), de-serializes them andassembles the complete results. The statistical signiﬁcance scores are then computedand sent to their origin processes. For example, the process with MPI rank 4 willprocess the all intermediate result batches with tag 0x8 i where i = 0 , , .., P −

1. Theassembly process is done through signal addition and shift operations illustrated inFig. 1d. The expectation scores (e-Values ( ev )) are computed using null hypothesisapproach by ﬁrst smoothing the assembled data through Savitzky-Golay ﬁlter and thenapplying signiﬁcance test through either the Linear-Tail Fit [37] or log-Weibull (Gum-bel) Fit method illustrated in Fig. 1d. The computed e-Values along with additionalinformation (16 bytes) are sent to the HiCOPS process that recorded the most signif-icant database hit (origin). The computed results are not sent immediately but areaccumulated in a map data structure and sent collectively when all batches are done.All available cores ( c p i ) are assigned to this sub-task. Supplementary Algorithm 4depicts the algorithmic work performed by this sub-task.The second sub-task runs waits for P − Cost Analysis:

The ﬁrst sub-task reads the intermediate results, performs re-gression and sends computed results to other processes in runtime: k ( Q p i , c p i ) + k ( Q p i , c p i ) + k ( P,

1) time. The second sub-task receives complete results fromother tasks in runtime: k ( P, k ( Q p i ). Collectively, the runtime for this superstep is given by equation 11. T = max p i ( max ( k ( Q p i , c p i ) + k ( Q p i , c p i ) + k ( P, , k ( P, k ( Q p i ) + γ p i )(11)To simplify equation 11, we can re-write it as a sum of computation costs plus thecommunication overheads ( k com ( P, T = max p i ( k ( Q p i , c p i ) + k ( Q p i , c p i ) + k com ( P,

1) + k ( Q p i ) + γ p i )Assuming that the network latency is denoted as ( ω ), bandwidth is denoted as ( π )and (16 Q p i ) is the average data packet size in bytes, the inter-process communicationoverhead cost ( k com ( P, k com ( P, ≈ P − ω + 16 Q p i /π ) Remarks:

As the communication per process are limited to only one data ex-change between any pair of processes, the overall runtime given by equation 12 ishighly scalable. The eﬀective communication cost depends on the amount of overlapwith computations and the network parameters at the time of experiment.

Performance Analysis

To quantify the parallel performance, we decompose the total HiCOPS time T H (Eq. 1)into three runtime components. i.e. parallel runtime ( T p ), serial runtime ( T s ) and verheads runtime ( T o ) given as: T H = (cid:88) j =1 max p i ( T j,p i ) = T o + T s + T p (13)Using equations 1, 5, 6, 10, and 12, we separate the three runtime components as: T o = V p i ( q, D p i , P ) + γ p i (14) T s = k ( D ) + k ( qβ ) + k com ( P,

1) (15)and: T p = k ( D p i ) + k (cid:48) ( D p i log D p i ) + k ( Q p i ) + k ( Q p i )+ max ( t p i ( t, | r | ) , t p i ( i, | i | ) , t p i ( k, | k | )) + k ( Q p i , c p i )+ k ( Q p i , c p i ) + k ( Q p i ) (16) T s is the minimum serial time required for HiCOPS execution and cannot be furtherreduced. Therefore, we will focus on optimizing the remaining runtime: T F = T p + T o .Using equations 14 and 16, we have: T F = k ( D p i ) + k (cid:48) ( D p i log D p i ) + k ( Q p i ) + k ( Q p i )+ max ( t p i ( t, | r | ) , t p i ( i, | i | ) , t p i ( k, | k | )) + k ( Q p i , c p i )+ k ( Q p i , c p i ) + k ( Q p i ) + T o (17)Since the HiCOPS parallel processes divide the database and experimental datasetroughly fairly in supersteps 1 and 2, the ﬁrst four and the sixth term in T p are alreadyalmost optimized, so we can prune them from T F : T F = max ( t p i ( t, | r | ) , t p i ( i, | i | ) , t p i ( k, | k | )) + k ( Q p i , c p i )+ k ( Q p i , c p i ) + + k ( Q p i ) + T o (18)Recall that the superstep 4 runtime is optimized for maximum parallelism (andleast inter-process communication) and that the superstep 3 performs the largest frac-tion of overall algorithmic workload. Thus, we can also remove the superstep 4 termsfrom T F to simplify analysis: T F = max ( t p i ( t, | r | ) , t p i ( i, | i | ) , t p i ( k, | k | )) + T o Further, as that the superstep 3 is executed using the producer-consumer pipeline(Fig. 1c) where the sub-task R must produce all data before it can be consumed by I meaning its runtime must also be smaller than t p i ( i, | i | ) and t p i ( k, | k | ) allowing asafe removal from the above equation yielding: T F = max ( t p i ( i, | i | ) , t p i ( k, | k | )) + T o In above equation, we can rewrite the max ( . ) term as the time to complete sub-task I : ( t p i ( i, | i | )) plus the extra time to complete sub-task K (the last consumer): t x ( k ). Therefore, using equation 9 we have: T F = k (cid:48) ( q log( D p i ) , | i | ) + k (cid:48) ( qβ log( α p i ) , | i | )+ k ( qβσ p i , | i | ) + k ( qµ p i , | i | ) + t x ( k ) + T o (19) e can prune the ﬁrst two terms in the equation 19 as well since their runtimecontribution: O (log N ) will be relatively very small. Finally, using equation 14 in 19,we have: T F = k ( qβσ p i , | i | ) + k ( qµ p i , | i | ) + t x ( k ) + V p i ( q, D p i , P ) + γ p i (20) Remarks:

The equations 17-19 and the simpliﬁcations made may be modiﬁedaccording to the changes in superstep design and/or the algorithms employed in eithersuperstep.

Optimizations

The overhead cost term: V i ( q, P ) represents the load imbalance (or synchronization),producer-consumer speed mismatch, and data read costs, while the term: t x ( k ) rep-resents the data write cost. Note that these overheads may result in a large subsetof processing cores to idle (wasted CPU cycles). Furthermore, note that the load im-balance cost encapsulates all other costs in itself. The following sections discuss theoptimization techniques employed to alleviate these overhead costs. Buﬀering

Four queues, the forward queue ( q f ), recycle queue ( q r ) and result queues ( q k , q (cid:48) k )are initialized and routed between the producer-consumer sub-tasks in the superstep 3(Fig. 1c) as: R → I , R ← I , I → K and I ← K respectively. The q r is initialized with(default: 20) empty buﬀers for the sub-task R to ﬁll the pre-processed experimentaldata batches and push in q f . The sub-task I removes a buﬀer from q f , consumes it(searches it) and pushes back to q r for re-use. The results are pushed to q k which areconsumed by sub-task K and pushed back to q (cid:48) k for re-use. Three regions are deﬁnedfor the queue q f based on the number of data buﬀers it contains at any time. i.e. w :( q f .len < w : (5 ≤ q f .len <

15) and w : ( q f .len ≥ w l ) areused by the task-scheduling algorithm discussed in the following section. Task Scheduling

The task scheduling algorithm is used to maintain a synergy between the producer-consumer (sub-task) pipeline in the superstep 3. The algorithm initializes a threadpool of c p i + 2 threads where c p i is the number of available cores. In the ﬁrst iteration,2 threads are assigned to the sub-tasks R and K while the remaining c p i − I . Then, in each iteration, the q f region: w l , and the q f .pop () timefor I , given by: t wait , are monitored. A time series is built to forecast the next t wait (i.e. t fct ) using double exponential smoothing [38]. The t wait is also accumulated into t cum . Two thresholds are deﬁned: minimum wait ( t min ) and maximum cumulativewait ( t max ). Using all this information, a thread is removed from sub-task I and addedto R if the following conditions are satisﬁed: c I → R = ( t wait ≥ t min ∧ ( t cum + t fct ) > t max ) ∨ ( w l = w ∧ | r | = 0)The t cum is set to 0 every time a thread is added to R . Similarly, a thread isremoved from sub-task R and added to I if the following conditions are satisﬁed. ll threads are removed from R if the queue q f becomes full or there is no moreexperimental MS/MS data left to be loaded. c R → I = ( w l = w ∧ | r | > ∨ q f .full ()The sub-task K uses its 2 over-subscribed threads to perform the overlapped I/Ooperations concurrently (Fig. 1c). Load Balancing

The algorithmic workload in equation 20 is given by: k ( qβσ p i , | i | ) + k ( qµ p i , | i | ). Here, the terms qβ and q are constants (experimental data size) whereas theterms σ p i and µ p i are variable. The variable terms represent the ﬁltered databasesize for a parallel HiCOPS process ( p i ) and thus, must be balanced across processes.We do this statically by constructing balanced database partitions (hence a balancedworkload) using the LBE algorithm supplemented with our new Mod Distance metricin Superstep 1 (Online Methods, Fig. 1a, Supplementary Fig. 3). The correctness ofthe LBE algorithm for load balancing is proven in Supplementary Text 4. In future,we plan to devise and develop dynamic load balancing techniques in addition to thisstatic technique for better results.

Sampling

The intermediate result produced by a parallel process ( p i ) for an experimental spec-trum ( q ) consists: M top scoring database hits (8 bytes) and the frequency distributionof scores (local null distribution) (2048 bytes). Since this frequency distribution followsa log-Weibull, most of the data are localized near the mean. Using this information,we locate the mean and sample s (default: 120) most intense samples from the distri-bution, and remove the samples, if necessary, from the tail ﬁrst. This allows us to ﬁtall the intermediate results in a buﬀer of 256 bytes limiting the size of each batch to1.5MB. Thus, the intermediate results are almost instantly written to the ﬁle systemby the sub-task K resulting in minimum data write cost: t x ( k ). Fig. 5 illustrates anexample of the sampling method. Code Availability

The HiCOPS core parallel model and algorithms have been implemented using object-oriented C++14 and MPI. The rich instrumentation feature has been implementedvia Timemory [39] for performance analysis and optimizations. Command-line toolsfor MPI task mapping (Supplementary Text 5, Supplementary Algorithm 5), userparameter parsing, peptide sequence database processing, ﬁle format conversion andresult post-processing are also distributed with the HiCOPS framework. The buildis managed via CMake 3.11+ [40]. Please refer to the software web page: hicops.github.io for source code and documentation.

Data Availability

The datasets and database used in this study are publicly available from the men-tioned respective data repositories. The experiment conﬁguration ﬁles and raw resultspertinent to the ﬁndings of this study are available from the corresponding authorupon request. s =120 datapoints around the mean are kept around the mean and the others are discarded.The discarding method prunes the distribution tail samples ﬁrst as they can berecovered by ﬁtting a log-Weibull distribution in the sampled data. Acknowledgments

This work used the NSF Extreme Science and Engineering Discovery Environment(XSEDE) Supercomputers through allocations: TG-CCR150017 and TG-ASC200004.This research was supported by the NIGMS of the National Institutes of Health (NIH)under award number: R01GM134384. The authors were further supported by theNational Science Foundations (NSF) under the award number: NSF CAREER OAC-1925960. The content is solely the responsibility of the authors and does not necessarilyrepresent the oﬃcial views of the National Institutes of Health and/or National ScienceFoundation.

References [1] Alexey I Nesvizhskii. A survey of computational methods and error rate esti-mation procedures for peptide and protein identiﬁcation in shotgun proteomics.

Journal of proteomics , 73(11):2092–2123, 2010.[2] Andy T Kong, Felipe V Leprevost, Dmitry M Avtonomov, Dattatreya Mel-lacheruvu, and Alexey I Nesvizhskii. Msfragger: ultrafast and comprehensivepeptide identiﬁcation in mass spectrometry–based proteomics.

Nature methods ,14(5):513, 2017.[3] Sean McIlwain, Kaipo Tamura, Attila Kertesz-Farkas, Charles E Grant, BenjaminDiament, Barbara Frewen, J Jeﬀry Howbert, Michael R Hoopmann, Lukas Kall,Jimmy K Eng, et al. Crux: rapid open source protein tandem mass spectrometryanalysis.

Journal of proteome research , 13(10):4488–4491, 2014.[4] Zuo-F ei Yuan, Chao Liu, Hai-Peng Wang, Rui-Xiang Sun, Yan Fu, Jing-FenZhang, Le-Heng Wang, Hao Chi, You Li, Li-Yun Xiu, et al. pparse: A method or accurate determination of monoisotopic peaks in high-resolution mass spectra. Proteomics , 12(2):226–235, 2012.[5] Yamei Deng, Zhe Ren, Qingfei Pan, Da Qi, Bo Wen, Yan Ren, Huanming Yang,Lin Wu, Fei Chen, and Siqi Liu. pclean: an algorithm to preprocess high-resolution tandem mass spectra for database searching.

Journal of proteomeresearch , 18(9):3235–3244, 2019.[6] Sven Degroeve and Lennart Martens. Ms2pip: a tool for ms/ms peak intensityprediction.

Bioinformatics , 29(24):3199–3203, 2013.[7] Xie-Xuan Zhou, Wen-Feng Zeng, Hao Chi, Chunjie Luo, Chao Liu, Jianfeng Zhan,Si-Min He, and Zhifei Zhang. pdeep: predicting ms/ms spectra of peptides withdeep learning.

Analytical chemistry , 89(23):12690–12697, 2017.[8] Jing Zhang, Lei Xin, Baozhen Shan, Weiwu Chen, Mingjie Xie, Denis Yuen,Weiming Zhang, Zefeng Zhang, Gilles A Lajoie, and Bin Ma. Peaks db: de novosequencing assisted database search for sensitive and accurate peptide identiﬁca-tion.

Molecular & Cellular Proteomics , 11(4):M111–010587, 2012.[9] Arun Devabhaktuni, Sarah Lin, Lichao Zhang, Kavya Swaminathan, Carlos GGonzalez, Niclas Olsson, Samuel M Pearlman, Keith Rawson, and Joshua E Elias.Taggraph reveals vast protein modiﬁcation landscapes from large tandem massspectrometry datasets.

Nature biotechnology , page 1, 2019.[10] Hao Chi, Chao Liu, Hao Yang, Wen-Feng Zeng, Long Wu, Wen-Jing Zhou, Xiu-Nan Niu, Yue-He Ding, Yao Zhang, Rui-Min Wang, et al. Open-pﬁnd enablesprecise, comprehensive and rapid peptide identiﬁcation in shotgun proteomics. bioRxiv , page 285395, 2018.[11] Marshall Bern, Yuhan Cai, and David Goldberg. Lookup peaks: a hybrid of denovo sequencing and database search for protein identiﬁcation by tandem massspectrometry.

Analytical chemistry , 79(4):1393–1400, 2007.[12] Jimmy K Eng, Ashley L McCormack, and John R Yates. An approach to correlatetandem mass spectral data of peptides with amino acid sequences in a proteindatabase.

Journal of the American Society for Mass Spectrometry , 5(11):976–989,1994.[13] Robertson Craig and Ronald C Beavis. A method for reducing the time requiredto match protein sequences with tandem mass spectra.

Rapid communications inmass spectrometry , 17(20):2310–2316, 2003.[14] Benjamin J Diament and William Staﬀord Noble. Faster sequest searching forpeptide identiﬁcation from tandem mass spectra.

Journal of proteome research ,10(9):3871–3879, 2011.[15] Jimmy K Eng, Bernd Fischer, Jonas Grossmann, and Michael J MacCoss. A fastsequest cross correlation algorithm.

Journal of proteome research , 7(10):4598–4602, 2008.[16] Christopher Y Park, Aaron A Klammer, Lukas Kall, Michael J MacCoss, andWilliam S Noble. Rapid and accurate peptide identiﬁcation from tandem massspectra.

Journal of proteome research , 7(7):3022–3027, 2008.

17] Lewis Y Geer, Sanford P Markey, Jeﬀrey A Kowalak, Lukas Wagner, Ming Xu,Dawn M Maynard, Xiaoyu Yang, Wenyao Shi, and Stephen H Bryant. Open massspectrometry search algorithm.

Journal of proteome research , 3(5):958–964, 2004.[18] Alexander S Hebert, Alicia L Richards, Derek J Bailey, Arne Ulbrich, Emma ECoughlin, Michael S Westphall, and Joshua J Coon. The one hour yeast proteome.

Molecular & Cellular Proteomics , 13(1):339–347, 2014.[19] Alexey I Nesvizhskii, Franz F Roos, Jonas Grossmann, Mathijs Vogelzang,James S Eddes, Wilhelm Gruissem, Sacha Baginsky, and Ruedi Aebersold. Dy-namic spectrum quality assessment and iterative computational analysis of shot-gun proteomic data toward more eﬃcient identiﬁcation of post-translational mod-iﬁcations, sequence polymorphisms, and novel peptides.

Molecular & CellularProteomics , 5(4):652–670, 2006.[20] Jimmy K Eng, Brian C Searle, Karl R Clauser, and David L Tabb. A face inthe crowd: recognizing peptides through database search.

Molecular & CellularProteomics , pages mcp–R111, 2011.[21] Muhammad Haseeb and Fahad Saeed. Eﬃcient shared peak counting in databasepeptide search using compact data structure for fragment-ion index. In , pages 275–278. IEEE, 2019.[22] Fahad Saeed. Communication lower-bounds for distributed-memory computa-tions for mass spectrometry based omics data. arXiv preprint arXiv:2009.14123 ,2020.[23] Vivien Marx. Biology: The big challenges of big data, 2013.[24] Dexter T Duncan, Robertson Craig, and Andrew J Link. Parallel tandem: aprogram for parallel processing of tandem mass spectra using pvm or mpi and x!tandem.

Journal of proteome research , 4(5):1842–1847, 2005.[25] Robert D Bjornson, Nicholas J Carriero, Christopher Colangelo, Mark Shifman,Kei-Hoi Cheung, Perry L Miller, and Kenneth Williams. X!! tandem, an improvedmethod for running x! tandem in parallel on collections of commodity computers.

The Journal of Proteome Research , 7(1):293–299, 2007.[26] Brian Pratt, J Jeﬀry Howbert, Natalie I Tasman, and Erik J Nilsson. Mr-tandem:parallel x! tandem using hadoop mapreduce on amazon web services.

Bioinfor-matics , 28(1):136–137, 2011.[27] Chuang Li, Kenli Li, Keqin Li, and Feng Lin. Mctandem: an eﬃcient toolfor large-scale peptide identiﬁcation on many integrated core (mic) architecture.

BMC bioinformatics , 20(1):397, 2019.[28] C Li, K Li, T Chen, Y Zhu, and Q He. Sw-tandem: A highly eﬃcient toolfor large-scale peptide sequencing with parallel spectrum dot product on sunwaytaihulight.

Bioinformatics (Oxford, England) , 2019.[29] Amol Prakash, Shadab Ahmad, Swetaketu Majumder, Conor Jenkins, and BenOrsburn. Bolt: A new age peptide search engine for comprehensive ms/ms se-quencing through vast protein databases in minutes.

Journal of The AmericanSociety for Mass Spectrometry , 30(11):2408–2418, 2019.

30] Leslie G Valiant. A bridging model for parallel computation.

Communications ofthe ACM , 33(8):103–111, 1990.[31] John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, An-drew Grimshaw, Victor Hazlewood, Scott Lathrop, Dave Lifka, Gregory D Pe-terson, et al. Xsede: accelerating scientiﬁc discovery.

Computing in Science &Engineering , 16(5):62–74, 2014.[32] Robertson Craig and Ronald C Beavis. Tandem: matching proteins with tandemmass spectra.

Bioinformatics , 20(9):1466–1467, 2004.[33] Muhammad Haseeb, Fatima Afzali, and Fahad Saeed. Lbe: A computational loadbalancing algorithm for speeding up parallel peptide search in mass-spectrometrybased proteomics. In , pages 191–198. IEEE, 2019.[34] Hao Chi, Kun He, Bing Yang, Zhen Chen, Rui-Xiang Sun, Sheng-Bo Fan, KunZhang, Chao Liu, Zuo-Fei Yuan, Quan-Hui Wang, et al. pﬁnd–alioth: A novelunrestricted database search algorithm to improve the interpretation of high-resolution ms/ms data.

Journal of proteomics , 125:89–97, 2015.[35] Jiarui Ding, Jinhong Shi, Guy G Poirier, and Fang-Xiang Wu. A novel approachto denoising ion trap tandem mass spectra.

Proteome Science , 7(1):9, 2009.[36] Kaiyuan Liu, Sujun Li, Lei Wang, Yuzhen Ye, and Haixu Tang. Full-spectrumprediction of peptides tandem mass spectra using deep neural network.

AnalyticalChemistry , 92(6):4275–4283, 2020.[37] David Feny¨o and Ronald C Beavis. A method for assessing the statistical sig-niﬁcance of mass spectrometry-based protein identiﬁcations using general scoringschemes.

Analytical chemistry , 75(4):768–774, 2003.[38] Joseph J LaViola. Double exponential smoothing: an alternative to kalman ﬁlter-based predictive tracking. In

Proceedings of the workshop on Virtual environments2003 , pages 199–206, 2003.[39] Jonathan R Madsen, Muaaz G Awan, Hugo Brunie, Jack Deslippe, Rahul Gayatri,Leonid Oliker, Yunsong Wang, Charlene Yang, and Samuel Williams. Timem-ory: Modular performance analysis for hpc. In

International Conference on HighPerformance Computing , pages 434–452. Springer, 2020.[40] Bill Hoﬀman, David Cole, and John Vines. Software process for rapid develop-ment of hpc software using cmake. In , pages 378–382. IEEE, 2009.[41] Gaurav Kulkarni, Ananth Kalyanaraman, William R Cannon, and Douglas Bax-ter. A scalable parallel approach for peptide identiﬁcation from large-scale massspectrometry data. In , pages 423–430. IEEE, 2009. upplementary Figures Supplementary Figure 1

The proteins are proteolyzed into peptides using an enzyme, typically Trypsin. Theresultant peptide mixture is are fed to an automated liquid chromatography (LC) cou-pled two-staged MS/MS pipeline (LC-MS/MS) which yields the experimental MS/MSdata.

Supplementary Figure 2

The acquired experimental MS/MS data are compared against a database of model-spectra data. The model-spectra are simulated in-silico using a protein sequencedatabase. Post-translational modiﬁcations (PTMs) are added in the simulation pro-cess to expand the search space.

Supplementary Figure 3

The improved LBE method used in the superstep 1 clusters the model-spectra databaseentries (shown as shapes) using two distance metrics: Edit Distance (∆ e ) and ModDistance (∆ m ) (Supplementary Text 4). The obtained database clusters are thenﬁnely and evenly scattered across database partitions at parallel HiCOPS processes ineither round robin or random fashion. upplementary Figure 4 The following sub-ﬁgures show the decomposition of the runtime, speedup and strong-scale eﬃciency results obtained for all 12 experiment sets ( e i ) into individual supersteps( s j ) and overheads ( V ). The sub-ﬁgures depict that the overall eﬃciency increases asthe workload (database, dataset and search ﬁlter) size increase. It can also be seenthat the overall speedup (and eﬃciency) closely follows the superstep 3 ( s ) conﬁrmingits largest contribution towards the overall performance. This observation further in-dicates that the overheads associated with this supersteps must be correctly identiﬁedand optimized for the best performance. The observed super-linear speedups observedin case of large experimental workloads result from the improved CPU utilization dueto the reduced memory intensity per parallel node (See Fig. 3c). upplementary Text and Algorithms Supplementary Text 1

Related Work.

The distributed memory parallel database peptide search algorithmsemerged with the Parallel Tandem [24], which is a variant of the X!Tandem [32] tool.Parallel Tandem achieves parallelism by spawning multiple instances of the originalX!Tandem using MPI or PVM, where each instance processes a chunk of experimentaldataset ﬁles. X!!Tandem [25] is another variant of the X!Tandem which implements aninternal (but similar) parallel technique for computational and synchronization steps.The experimental dataset ﬁles in the case of X!!Tandem are shuﬄed among the MPIprocesses to achieve better load balance. MR-Tandem [26] follows a strategy similarto X!!Tandem however by breaking computations into small Map and Reduce tasks(Map-Reduce model) exhibiting better parallel eﬃciency than the Parallel Tandemand X!!Tandem. MCtandem [27] and SW-Tandem [28] implement the same paralleldesign but oﬄoad the X!Tandem’s expensive Spectral Dot Product (SDP) computa-tions over Intel Many Integrated Core (MIC) co-processor and Haswell AVX2 vectorinstructions respectively. Both algorithms also implement optimization techniques in-cluding double buﬀering, pre-fetching, overlapped communication and computationsand a task-distributor for better performance. Bolt [29] implements a similar paralleldesign for MSFragger-like [2] algorithm where each parallel instance constructs a fullmodel-spectra database and processes a chunk of experimental data.

Supplementary Text 2

Limitations in Related Work.

The major limitation in all existing distributedmemory database peptide search algorithms is the inﬂated space complexity = O ( P N )where P is the number of parallel nodes and O ( N ) is the space complexity of theirshared-memory counter parts. The space complexity inﬂation stems from the replica-tion of massive model-spectra databases at all parallel instances. Consequently, theapplication of existing algorithms is limited to the use cases where the indexed model-spectra database size must ﬁt within the main memory on all system nodes to avoidthe expensive memory swaps, page faults, load imbalance and out-of-core processing verheads leading to an extremely ineﬃcient solution. Furthermore, as the PTMs areadded, this memory upper-bound is quickly exhausted due to the combinatorial in-crease in the database size [10], [2] incurring further slowdowns. For a reference, themodel-spectra database constructed from a standard Homo sapiens proteome sequencedatabase can grow from 3.8 million to 500 million model-spectra (0.6TB) if only thesix most common PTMs (i.e. oxidation, phosphorylation, deamidation, acetylation,methylation and hydroxylation) are incorporated. There is some eﬀorts towards inves-tigations of parallel strategies that involve splitting of model-spectra databases amongparallel processing units [41]. In these designs, the database search is implemented ina stream fashion where each parallel process receives a batch of experimental data,executes partial search, and passes on the results to the next process in the stream.However, these models suﬀer from signiﬁcant amounts of on-the-ﬂy computations andfrequent data communication between parallel nodes leading to high compute times,and limited ( ∼ Supplementary Text 3

Mod Distance.

The proposed

Mod Distance (∆ m ) is used as a supplementary met-ric in peptide database clustering in superstep 1 (the improved LBE method). Theapplication of this metric can be best understood through an example. Consider threedatabase peptide sequences p : MEGSYIRK, q : ME*GSYI*RK and r : MEGS*Y*IRK.The blue letters represent the normal amino acids in the peptide and the red letterswith (*) represent the modiﬁed amino acids. Now, we can see that the Edit Distancebetween the pairs ∆ e ( p, q ) = ∆ e ( p, r ) = 2 (cannot diﬀerentiate). Now let us applythe Mod Distance on this scenario which considers the shared peaks between the pep-tide pairs to further separate them. For example, the shared (b- and y-) ions (orpeaks) between p and q are: ME*GSYI*RK = 3 (green), yielding ∆ m ( p, q ) = 1 . p and r are: MEGS*Y*IRK = 6 (green), yielding∆ m ( p, r ) = 1 .

25. This indicates that the entries p and r should be located at rela-tively nearby database indices. The Mod Distance can be easily generalized for otherion-series such as: a-, c-, x-, z-ions and immonium ions as well.

Supplementary Text 4

Correctness of LBE.

Let the peptide precursor m/z distribution of any given databaseis g ( m ) and that of any given dataset is f ( m ) , then the LBE algorithm statically resultsin fairly balanced workloads at all parallel nodes.Proof. The algorithmic workload w ( f, g ) for database peptide search can be given asthe cost of performing the total number of comparisons to search the dataset f ( m )against the database g ( m ) using ﬁlter size δM and shared peaks ≥ k , mathematically: w ( f, g ) = cost ( ∞ (cid:88) m =0 f ( m ) δM (cid:88) z = − δM shp ( f ( m ) , g ( m + z ) , k ))where: shp ( f, g, k ) = count ( shared peaks ( f, g ) ≥ k )The above equations imply that the database distribution i.e. (cid:80) shp ( f ( m ) , g ( m + z ) , k ) must be similar at all parallel nodes in order to achieve system-wide load balance.The LBE algorithm achieves this by localizing (by δM and shared peaks) the database ntries and then ﬁnely scattering them across parallel nodes (Supplementary Fig. 3)producing identical local database distributions g loc ( m ) at parallel nodes thereby, iden-tical workloads. This theorem can also be extended to incorporate sequence-tag basedﬁltration methods in a straightforward manner. Supplementary Text 5

Task Mapping.

The parallel HiCOPS tasks are conﬁgured and deployed on sys-tem nodes based on the available resources, user parameters and the database size.The presented algorithm assumes a Linux based homogeneous multicore nodes clusterwhere the interconnected nodes have multicores, local shared memory and optionallya local storage as well. This is the most common architecture in modern supercom-puters including XSEDE Comet, NERSC Cori etc. The resource information is readusing Linux’s lscpu utility. Speciﬁcally, the information about shared memory pernode ( λ ), NUMA nodes per node ( u ), cores per NUMA node ( c u ), number of socketsper node ( s ) and cores per socket ( c s ) is read. The total size of database ( D ) is thenestimated using protein sequence database and user parameters. Assuming the totalnumber of system nodes to be P , the parameters: number of MPI tasks per node ( t n )and the number of parallel cores per MPI task ( t c ) and MPI task binding level ( t bl ) areoptimized as depicted in Supplementary Algorithm 5. The optimizations ensure that:1) System resources are eﬃciently utilized 2) The MPI tasks have suﬃcient resourcesto process the database and 3) The MPI tasks have an exclusive access to a disjointpartition of local compute and memory resources.Note that in Supplementary Algorithm 5, the lines 8 to 14 iteratively reduce thecores per MPI task while increasing the number of MPI tasks until the database size perMPI task is less than 48 million (empirically set for XSEDE Comet nodes). This wasdone to reduce the memory contention per MPI process for superior performance. Thewhile loop may be removed or modiﬁed depending on the database search algorithmsand machine parameters. Supplementary Text 6

SW-Tandem.

The SW-Tandem binaries were obtained from its GitHub repository: https://github.com/Logic09/SW-Tandem and were run on XSEDE Comet systemwith increasing number of parallel nodes using MPI but no speedups in runtime wereobserved. We repeatedly tried to contact the corresponding author about this issuevia Email and GitHub issues ( https://github.com/Logic09/SW-Tandem/issues ) butdid not receive a response as of the submission date of this paper. upplementary Algorithm 1 Algorithm 1:

Partial Database Construction in Superstep 1

Data: peptide sequences ( (cid:15) ) Result: indexed partial database ( D i ) /* generate database entries */ for s in (cid:15) do in parallel for v in m do e ← gen entry ( v ); /* add to partial database if mine */ if is mine ( e v ) then E.append ( e ); /* generate model-spectra */ for s in D i do in parallel S.append ( model spectrum ( s )); /* index the database in parallel */ D i ← map ( parallel sort ( E ) , parallel index ( S )); /* return the indexed parital database */ return D i ; 29 upplementary Algorithm 2 Algorithm 2:

Data load (per thread) by sub-task R (Superstep 3) Data: forward queue ( q f ), recycle queue ( q r ), pointer stack ( s d ), batchindex ( i d ) /* loop unless q f full, preempted or no more batches */ while ∼ q f .f ull () do /* check pointer stack */ if ∼ dp then dp ← s d .pop (); /* if stack is empty, get a new pointer */ if ∼ dp then dp ← i d .pop (); /* no more experimental data batches - exit */ if ∼ dp then break ; /* check preemption state and q r status */ if ∼ preempt () or ∼ q r .empty () then s d .push ( dp ); break ; else /* else get a buffer from q r */ bp ← q r .pop (); /* read a batch of expt data */ dp.read batch ( bp ); /* push the buffer to q f */ q f .push ( bp ); 30 upplementary Algorithm 3 Algorithm 3:

Partial DB search by sub-task R (Superstep 3) Data: forward queue ( q f ), recycle queue ( q r ), partial database ( D p i ),result queue ( q k ) /* extract a batch from queue */ b ← q f .pop (); /* data parallel search */ for e in b do in parallel /* apply the precursor mass filter */ σ p i ← f ilter ( D p i , e ); if σ p i then for β in e do /* apply the shared peaks filter */ µ p i .append ( f ilter ( σ p i , β )); /* score against the filtered database */ for h in µ p i do heap.push ( k ← score ( h, e )); /* append to a batch of intermediate results */ res i .append ( heap ); /* recycle the buffer back to q r */ q r .push ( b ); /* push the intermediate results batch to q k */ q k .push ( res i ); 31 upplementary Algorithm 4 Algorithm 4:

Result Assembly in Superstep 4

Data: rank p i , Intermediate Result batches ( r i ) Result: expect scores ( ev ) /* extract a batch from queue */ b ← q f .pop (); /* get batches that satisfy the condition */ for b in ( b mod p i = 0) do l.append ( b ); /* data parallel assembling of results for each batch */ for s in l do in parallel /* assemble the null distribution */ dist ← assemble ( s ); /* max heapify the scores */ heap ← make heap ( s ); /* use either fitting method */ f it ← logW eibullF it ( dist ); f it ← T ailF it ( dist ); /* get the top hit from heap */ g max ← heap.pop () .value (); /* compute the expect score */ ev ← ( f it.w × g max + f it.b ) × heap.size (); /* push results to a map structure */ map.push ( key = g max .key () , val = ev ); /* asynchronous scatter complete result data */ for p i in P do in parallel isend ( map.data ( key = p i ) , dst = p i ); /* synchronize using barrier */ barrier (); /* write the results to the file system */ write ( map.data ( key = rank )); 32 upplementary Algorithm 5 Algorithm 5:

Task Mapping

Data: number of nodes ( n ), node parameters ( λ, u, c u , s, c s ) anddatabase size ( D ) Result: number of MPI tasks per node ( t n ), cores per MPI task ( t c )and MPI binding level ( t bl ) /* ensure enough memory for database */ if D p i ← D/P > . λ then return err ; /* set MPI binding level */ t bl ← max { u, c } ; /* set MPI binding policy */ t bp ← scatter ; /* set cores per MPI task */ t c ← min { c u , c s } ; /* set number of MPI tasks per node */ t n ← max { u, c } ; t max ← t c ; /* Optional: optimize for memory bandwidth */ while ( D/t n > × ) do /* Choose the next highest factor of t max */ n poss ← f actorize ( t max ); if n poss ≥ t max / then t n ← t n × t max /n poss ; t c ← n poss ; else break ; return t n , t c , t bl , t bp ; 33 upplementary Tables Supplementary Table 1

A snippet of the peptide-to-spectrum matches (PSMs) and e-values obtained by search-ing the dataset: E against database: D (no mods, δM =500Da). Full table can berequested from the corresponding author. MatchedPeptide e-Values for Parallel nodes1 2 4 , ,

32 64

HLTYENVER 6.6e-5 6.5e-5 6.5e-5 6.5e-5 6.5e-5SEGESSRSVR 3.175e-3 3.174e-3 3.174e-3 3.175e-3 3.174e-3IFQCNKHMK 0.037038 0.037037 0.037037 0.037036 0.037037FIVSKNK 0.113302 0.113301 0.113298 0.113297 0.113297QQIVSGR 1.294027 1.293975 1.293975 1.293975 1.293975STVASMMHR 2.641636 2.64151 2.64151 2.64151 2.64151TLFKSSLK 7.000016 7.0 7.0 7.0 7.0QKQLLKEQK 16.856401 invalid 16.855967 invalid

Supplementary Table 2

Speed comparison between existing tools and HiCOPS for the experiment 1a, datasetsize: 8K, database size: 93.5M, precursor mass tolerance: δM = 10.0Da. SearchTool Execution Time (s) for parallel nodes1 2 4 8 16

HiCOPS - 166.32 126.35 113.53 134.86X!!Tandem 4980 2445 1279.8 690 360SW-Tandem 1015 992 1002 999 1019MSFragger 299.4 -X!Tandem 957 -Crux/Tide 2470 -34 upplementary Table 3

Speed comparison between existing tools and HiCOPS for the experiment 1b, datasetsize: 8K, database size: 93.5M, precursor mass tolerance: δM = 500.0Da . SearchTool Execution Time (s) for parallel nodes1 2 4 8 16 32 64

HiCOPS - 188 135 115 101 101 144X!!Tandem 115K 57.7K 29.05K 14.6K 7.4K 3.72K 1.98KSW-Tandem 19.99K 17.1K 15.4K 14.3K 15.1K 15K 15KMSFragger 521 -X!Tandem 18.65K -Crux/Tide segmentation fault

Supplementary Table 4

Speed comparison between HiCOPS and existing tools for the experiment 2a, datasetsize: 3.8M, database size: 93.5M, precursor mass tolerance: δM = 10.0Da. X!!Tandemand SW-Tandem ran for 2 days in all parallel conﬁgurations but failed to completeand were terminated by SLURM due to max job time limit on XSEDE Comet system. SearchTool Execution Time (s) for parallel nodes1 2 4 8 16

HiCOPS - 557.549 371.585 262.16 213.622X!!Tandem terminated after 2 daysSW-Tandem terminated after 2 daysMSFragger 13402.66 -X!Tandem 1.71M -Crux/Tide 875.5K - Tide limits the max peptide precursor tolerance to δM = ± upplementary Table 5 Speed comparison between HiCOPS and existing tools for the experiment 2b, datasetsize: 3.8M, database size: 93.5M, precursor mass tolerance: δM = 500.0Da . X!!Tandemand SW-Tandem ran for 2 days in all parallel conﬁgurations but failed to completeand were terminated by SLURM due to max job time limit on XSEDE Comet system.X!Tandem has been running for 75 days at the time of submission of this manuscriptand is expected to run over 8 months to complete its execution. SearchTool Execution Time (s) for parallel nodes1 2 4 8 16 32 64

HiCOPS - 23.5K 6.6K 2.8K 1.4K 807 485X!!Tandem terminated after 2 daysSW-Tandem terminated after 2 daysMSFragger 170.1K -X!Tandem 75 days* -Crux/Tide segmentation fault Tide limits the max peptide precursor tolerance to δM = ± upplementary Protocol Minimum Environment • Laptop, desktop, SMP cluster (HPC) with Linux OS • GCC 7.2+ compiler with C++14, OpenMP and threading • MPI with multiple threads support • Python 3.7+ and common packages • CMake 3.11+

Install

Comprehensive details about the required packages, supported environments and stepby step installation of packages and HiCOPS are documented at: hicops.github.io/installation . This link will be updated as the development progresses.

Getting Started

The instructions for setting up the peptide database, experimental MS/MS datasetand running HiCOPS are documented at: hicops.github.io/getting_started . Ifyou are running HiCOPS on SDSC XSEDE Comet cluster, you can follow a simplerset of instructions documented at: hicops.github.io/getting_started/xsede

Integrating with HiCOPS framework

The details on integrating the existing and new algorithms with the HiCOPS paral-lel core library are documented at: hicops.github.io/getting_started/integrate .Currently, the integration must be done via the provided functional interface (and datastructures). In near future, the integration will be redesigned using C++ templatemeta-programming interface. The documentation will be updated accordingly.

Command-line tools

Several command-line tools are distributed as a part of HiCOPS software. These toolsprovide support for runtime interface, preparation of database, dataset, and post-processing ﬁnal results. A brief summary of each tool is documented at: hicops.github.io/tools . More tools will be provided in the future releases.

Current version

The current released version of HiCOPS is v1.0: hicops.github.iohicops.github.io