[PDF] Computational Performance of a Germline Variant Calling Pipeline for Next Generation Sequencing

Abstract

With the booming of next generation sequencing technology and its implementation in clinical practice and life science research, the need for faster and more efficient data analysis methods becomes pressing in the field of sequencing. Here we report on the evaluation of an optimized germline mutation calling pipeline, HummingBird, by assessing its performance against the widely accepted BWA-GATK pipeline. We found that the HummingBird pipeline can significantly reduce the running time of the primary data analysis for whole genome sequencing and whole exome sequencing while without significantly sacrificing the variant calling accuracy. Thus, we conclude that expansion of such software usage will help to improve the primary data analysis efficiency for next generation sequencing.

Full PDF

CComputational Performance of a Germline VariantCalling Pipeline for Next Generation Sequencing

Jie Liu , Xiaotian Wu , Kai Zhang , Bing Liu , Renyi Bao , Xiao Chen ,Yiran Cai , Yiming Shen , Xinjun He , Jun Yan , and Weixing Ji School of Computer Science & Technology, Beijing Institute of Technology, Beijing, China Nanjing YiGenCloud Institute, Nanjing, China

Abstract —With the booming of next generation sequencingtechnology and its implementation in clinical practice and lifescience research, the need for faster and more efﬁcient data anal-ysis methods becomes pressing in the ﬁeld of sequencing. Here wereport on the evaluation of an optimized germline mutation call-ing pipeline, HummingBird, by assessing its performance againstthe widely accepted BWA-GATK pipeline. We found that theHummingBird pipeline can signiﬁcantly reduce the running timeof the primary data analysis for whole genome sequencing andwhole exome sequencing while without signiﬁcantly sacriﬁcingthe variant calling accuracy. Thus, we conclude that expansionof such software usage will help to improve the primary dataanalysis efﬁciency for next generation sequencing.

Index Terms —Next generation sequencing, variant calling,germline mutation, genomics

I. I

NTRODUCTION

Next generation sequencing has revolutionarily changed theway of basic life science research and clinical practice forcancer, infectious and genetic diseases [1], [2]. Populationbased large scale sequencing at national level helped to collectbaseline variation information for various genetic traits fromhealthy and susceptible individuals [3]–[6]. With rapid in-crease of the output from new sequencing technologies and thegradual decrease of per base sequencing cost, whole genomesequencing and whole exome sequencing have migrated fromnational and multi-national research projects to daily practicalclinical applications especially in the ﬁeld of cancer andgenetic disease diagnostics [5], [7], [8]. As it is the inheritedcharacter for next generation sequencing, the initial depositionof large amount sequencing data provided not only immediateassistances needed for its requisition, but also a rich dataresource for further exploration.BWA is a widely accepted tool for aligning sequencingreads to a reference genome [9], [10]. In particularly, BWAconsists of three different algorithms: BWA-backtrack, BWA-SW and BWA-MEM. BWA-MEM is the optimized algorithmfor short reads mapping. Even though BWA implementsthread-level parallelism, it takes a long time for large-scaledatasets. To solve this problem, many researchers used dif-ferent strategies such as Hadoop, Spark, MPI to accelerateBWA [11]–[13], it is still hard to reach the goal that one usesacceptable hardware resources.

Corresponding authors: Weixing Ji([email protected]) and XinjunHe([email protected]).

The Genome Analysis Toolkit (GATK) is a set of bioinfor-matic tools for analyzing high-throughput sequencing (HTS)and variant call format (VCF) data [14], [15]. The toolkitis well established for germline short variant discovery fromwhole genome and exome sequencing data. In this study,we used GATK 4.1.2.0 as the benchmark to evaluate theperformance of HummingBird pipeline.elPrep is a high-performance tool for preparing sequenc-ing reads for variant calling in a sequencing pipeline [16]–[18]. It contains optimized sequencing reads preparation stepsincluding reads sorting, duplicates marking and base qualityrecalibration. elPrep prepares sequencing reads and producesidentical results as SAMtools [19]–[21] and Picard [22]. elPrepcan be run in multithread and entirely in memory, thusavoiding repeated ﬁle I/O and signiﬁcantly reducing runningtime for sequencing reads preparation.The power of next generation sequencing is to collect asmuch genetic information as possible from sequenced samples;the large amount of data generated thus require computa-tionally intensive method by professional bioinformaticians toextract the genetic information initially looked for. However,it usually takes up to a few days to process a whole genomesequencing data and costs unacceptably long time when thereare a large number of samples with standard BWA [9], [10]for alignment and the Genome Analysis Toolkit (GATK) fromBroad Institute for short variant calling for primary dataanalysis [14], [15], [23]. Thus, more reliable, efﬁcient andreproducible sequencing data analysis tools are needed to meetthe needs of a speedy variant calling [24], [25].In this paper, we developed HummingBird pipeline whichcan reduce the clock time for germline variant calling withoutsigniﬁcantly sacriﬁcing the variant calling accuracy. Here weevaluated a germline mutation calling pipeline, HummingBirdpipeline, that was developed by YiduCloud (Beijing) company,and reported that HummingBird is a fast and accurate mutationcalling pipeline that can be expanded for sequencing dataanalysis.The rest of the paper is organized in this way: Section 2compares BWA-GATK and HummingBird in methodology;Section 3 analyzes the results of BWA-GATK and Humming-Bird under different sample sets; the conclusions are given inSection 4.1 a r X i v : . [ q - b i o . GN ] A p r ABLE II

NFORMATION OF DATASETS USED IN THIS STUDY

Sample Name Type Clean Reads Files Clean Reads

II. E

VALUATION M ETHODOLOGY

HummingBird pipeline is such an endeavor developed toreduce the clock time for germline variant calling by opti-mizing the standard BWA-GATK software interaction withmaximizing the usage of hardware. As in commonly acceptedBWA-GATK pipeline, the HummingBird pipeline is builtwith a modiﬁed BWA alignment algorithm (HB-BWA) butpreserving the underlying mathematical model. The mappedsequence reads in bam ﬁles are further piped into elPrepsoftware [16]–[18], a tool kit written in GO, that outputssorted, duplication marked and recalibrated sequence reads,which is next processed with an optimized, C++ based GATKHaplotypeCaller (HB-HaplotypeCaller) for variant calling.We assessed the efﬁciency of HummingBird pipeline byusing a serial publicly available datasets. The datasets usedin this paper included 12 whole genome/exome sequencingdata ﬁles from NCBI/EBI database (Table 1). Table 2 showsthe download links for the datasets. The raw reads of thesedatasets were processed with fastp [26] software to removelow quality reads. Figure 1 presents the major steps ofHummingBird and BWA-GATK pipelines. The benchmarkingBWA-GATK procedure ﬁrst maps the sequencing reads byBWA MEM 0.7.17 to human hg19 reference genome, andthen with SAMtools 1.9 for indexing, GATK 4.1.2.0 Sort-Sam, MarkDuplicates, BaseRecalibrator and ApplyBQSR toperform sequencing reads sorting, duplicates marking and baserecalibration. After the sequence reads have been prepared,GATK 4.1.2.0 HaplotypeCaller was used for variant calling.Both pipelines were run on a x86_64 standalone server withIntel(R) Xeon(R) Gold 6254 CPU*2 @ 3.10GHz, 72 threads,256GB memory and a 1.5TB SSD hard drive.

Fig. 1. Major steps of BWA-GATK pipeline and HummingBird pipeline.BWA-GATK pipeline was initiated by inputting two paired-end clean readsfastq ﬁles into BWA 0.7.17. Mapped reads in sam ﬁles were next sequen-tially processed by GATK SortSam, MarkDuplicates, BaseRecalibrator andapplyBQSR modules for sequencing reads prep. SAMtools index functionwas used to index variant intermediate ﬁles. The HaplotypeCaller module ofGATK 4.1.2.0 was used in last step for variant calling. HummingBird pipelineuses the same clean reads ﬁles for its modiﬁed HB-BWA, which is pipeswith elPrep to sorting, duplicates marking and base recalibration. Preparedsequencing reads in bam ﬁles were further processed by the modiﬁed HB-HaplotypeCaller for variant calling.

III. E

VALUATION R ESULTS

A. Running time comparison

To evaluate the performance of HummingBird pipeline, weﬁrst run the pipeline along with the standard BWA-GATKprocedure on the same standalone server. We evaluated theoverall running time for 12 datasets downloaded from publicdatabases. Figure 2 compares the total time cost of Humming-Bird pipeline and standard BWA-GATK pipeline. We found2

ABLE IID

OWNLOAD ADDRESS OF DATASETS USED IN THIS STUDY

Sample Name Raw Reads Download AddressNIST7035_L001 ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001_trimmed.fastq.gzftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R2_001_trimmed.fastq.gzNIST7035_L002 ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L002_R1_001_trimmed.fastq.gzftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L002_R2_001_trimmed.fastq.gzNIST7086_L001 ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L001_R1_001_trimmed.fastq.gzftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L001_R2_001_trimmed.fastq.gzNIST7086_L002 ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L002_R1_001_trimmed.fastq.gzftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L002_R2_001_trimmed.fastq.gzSRR098401 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098401/SRR098401_1.fastq.gzftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098401/SRR098401_2.fastq.gzSRR742200 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR742/SRR742200/SRR742200_1.fastq.gzftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR742/SRR742200/SRR742200_2.fastq.gzSRR098359 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098359/SRR098359_1.fastq.gzftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098359/SRR098359_2.fastq.gzSRR768308 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR768/SRR768308/SRR768308_1.fastq.gzftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR768/SRR768308/SRR768308_2.fastq.gzERR091571 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_1.fastq.gzftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_2.fastq.gzNA12878_1000G ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/NA12878/sequence_read/NA12878_NIST ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/ERR262997 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR262/ERR262997/ERR262997_1.fastq.gzftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR262/ERR262997/ERR262997_2.fastq.gz that, overall, the HummingBird pipeline processed germlinemutation calling much more efﬁcient than the standard BWA-GATK pipeline, with a maximal speedup of 12.7x for thewhole genome sequencing dataset SRR768308 and 4.6x forthe whole exome sequencing dataset SRR098359.

Fig. 2. Total time cost of HummingBird pipeline compared withStandard BWA-GATK pipeline. Standard BWA-GATK and HummingBirdpipeline were run along each other for each dataset indicated. DatasetsNIST7035_L001, NIST7035_L002, NIST7086_L001, NIST7086_L002,SRR098401, SRR742200, SRR098359 are from WES sequencing, anddatasets SRR768308, ERR091571, NA12878_1000G, NA12878_NIST,ERR262997 are from WGS sequencing.

To further assess the efﬁciency of individual steps alongthe pipe, we looked into the alignment and variant callingsteps respectively as elPrep has been shown to efﬁcientlyprep mapped reads for GATK HaplotypeCaller. The timecost is shown in Figure 3A and 3B respectively. We foundthat HB-BWA and HB-HaplotypeCaller performed relativelymore efﬁcient than the standard BWA MEM and GATKHaplotypeCaller, with the maximal speedup of 9.4x in BWAalignment step for dataset ERR262997 and 36.9x in haplotypecaller step for dataset ERR091571.Overall, we concluded that, with the help from elPrep, theHB-BWA and HB-HaplotypeCaller within the HummingBirdpipeline are more efﬁcient for processing germline variantcalling than the standard BWA-GATK procedure.

B. Variant calling accuracy

Recall reﬂects the percentage of variants obtained by thestandard BWA-GATK benchmark pipeline that were calledby HummingBird pipeline in each dataset. Precision is thepercentage of called variants which match variants obtained bythe standard BWA-GATK benchmark pipeline. F1-score is theharmonic mean of recall and precision, as a combined metricfor evaluating overall accuracy. With the effort to assessmentthe accuracy of HummingBird pipeline, we calculated theresult of precision and recall and further obtained F1-score3 ig. 3. Time cost of alignment and haplotype caller steps of two pipelines.Running time for alignment (A) and haplotype caller (B) were recordedrespectively within the complete pipelines and plotted based on duration foreach dataset. on each test by the following formula: F core = precision × recallprecision + recall × (1)All these results are shown in Figure 4. It can be seenthat the precision, recall and F1-score on these data samplesare consistently high across different datasets, except fordataset ERR262997, a dataset generated for large insert libraryevaluation, with an F1-score of 0.87.We also looked into the performance of HummingBirdpipeline on the gold standard sample NA12878, one of the most studied samples with a better knowledge of its variantsby using the hap.py evaluation tool at the high conﬁdenceintervals on chromosome 1-22, X and Y developed by IlluminaInc (19). Results on dataset NA12878-NIST showed thatHummingBird pipeline generated the comparable accuracyscore with standard BWA-GATK pipeline for both INDEL andSNP (Figure 5, additional results are shown in Table 3). Fig. 5. Accuracy comparison between HummingBird and BWA-GATK onthe sample NA12878. Using the well-studied NA12878-NIST as the standarddataset, HummingBird and BWA-GATK pipeline were assessed against goldstandard Illumina trustset variants (v2017-1.0)(19). INDEL and SNP werefurther evaluated separately for Recall, Precision and F1-Score.

C. Variant calling consistency

To further assess the repeatability of the variant callingresults, we conducted the triplicated variant calling runs ofHummingBird pipeline on 10 samples, and the correspondingF1-scores are shown in Figure 6. results showed that Hum-mingBird has a stable variant calling ability.IV. D

ISCUSSION AND F UTURE D IRECTIONS

In this study, we validated the efﬁciency of HummingBirdgermline variant calling pipeline by comparing it with thewidely accepted BWA-GATK pipeline. High speedup andcomparable accuracy support the superiority of HummingBird

Fig. 4. Accuracy of HummingBird pipeline on each sequencing. Recall, Precision and F1-Score were calculated as described for each dataset indicated. ig. 6. Consistency of variant call by HummingBird pipeline. HummingBird pipeline were run in triplicates, accuracy metrics of the triplicated variant callingruns of HummingBird pipeline were evaluated on selected samples.TABLE IIIA CCURACY METRICS OF H UMMING B IRD TRIPLICATES RUN data set Repeats Recall Precision F1-ScoreNIST7035_L001 Repeat1 0.9842 0.8926 0.9362Repeat2 0.9842 0.8927 0.9362Repeat3 0.9842 0.8927 0.9362NIST7035_L002 Repeat1 0.9845 0.8943 0.9372Repeat2 0.9844 0.8943 0.9372Repeat3 0.9844 0.8943 0.9372NIST7086_L001 Repeat1 0.9837 0.8951 0.9373Repeat2 0.9837 0.8952 0.9374Repeat3 0.9837 0.8951 0.9373NIST7086_L002 Repeat1 0.9858 0.8928 0.9370Repeat2 0.9858 0.8928 0.9370Repeat3 0.9858 0.8928 0.9370SRR098359 Repeat1 0.9845 0.9012 0.9410Repeat2 0.9845 0.9011 0.9410Repeat3 0.9845 0.9011 0.9410SRR742200 Repeat1 0.9870 0.9218 0.9533Repeat2 0.9870 0.9218 0.9533Repeat3 0.9870 0.9218 0.9533SRR098401 Repeat1 0.9827 0.8944 0.9365Repeat2 0.9828 0.8944 0.9365Repeat3 0.9828 0.8943 0.9365ERR262997 Repeat1 0.8857 0.8733 0.8794Repeat2 0.8857 0.8732 0.8794Repeat3 0.8857 0.8733 0.8794NA12878_1000G Repeat1 0.9624 0.8938 0.9269Repeat2 0.9624 0.8939 0.9269Repeat3 0.9624 0.8938 0.9269ERR091571 Repeat1 0.9817 0.9233 0.9516Repeat2 0.9817 0.9234 0.9516Repeat3 0.9817 0.9233 0.9516 than BWA-GATK pipeline. Thus, it provides an alternativeway of germline mutation variant calling to meet the needs ofthe community.The optimization to germline mutation variant calling onstandalone machine is important but not adequate. In thisstudy, we did not test its performance in a cluster com- putation environment, although the pipeline can be readilyimplemented. With the rapid increase of human genomicsdata, we need to pay more effort to implement and optimizeHummingBird pipeline on distributed clusters and thus bearlarger scale computing.A

CKNOWLEDGEMENTS

We thank the individuals who have made their data publiclyavailable. We also would like to thank the public sequenc-ing projects made the genomic data available for differentresearchers to explore. We thank Wei Yue, Lei Ye, JinpengJiang, Yueyan Zhao, Jianhua Gao, Shiyu Han and ZhaonianTan for their inputs during this manuscript preparation.R

EFERENCES[1] K. M. Boycott, M. R. Vanstone, D. E. Bulman, A. E. MacKenzie, Rare-disease genetics in the era of next-generation sequencing: discovery totranslation. Nat Rev Genet 14, 681-691 (2013).[2] R. B. Altman et al., A research roadmap for next-generation sequencinginformatics. Sci Transl Med 8, 335ps310 (2016).[3] UK Biobank data on 500,000 people paves way to precision medicine.Nature 562, 163-164 (2018).[4] N. Cox, UK Biobank shares the promise of big data. Nature 562, 194-195 (2018).[5] M. J. Khoury et al., From public health genomics to precision publichealth: a 20-year journey. Genet Med 20, 574-582 (2018).[6] A. Auton et al., A global reference for human genetic variation. Nature526, 68-74 (2015).[7] G. R. Abecasis et al., A map of human genome variation frompopulation-scale sequencing. Nature 467, 1061-1073 (2010).[8] G. R. Abecasis et al., An integrated map of genetic variation from 1,092human genomes. Nature 491, 56-65 (2012).[9] H. Li, R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760 (2009).[10] H. Li, R. Durbin, Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589-595 (2010).[11] L. Pireddu, S. Leo, G. Zanetti, SEAL: a distributed short read mappingand duplicate removal tool. Bioinformatics 27, 2159-2160 (2011).[12] J. M. Abuín, J. C. Pichel, T. F. Pena, J. Amigo, SparkBWA: SpeedingUp the Alignment of High-Throughput DNA Sequencing Data. PLoSOne 11, e0155461 (2016).

13] D. Peters, X. Luo, K. Qiu, P. Liang, Speeding Up Large-Scale NextGeneration Sequencing Data Analysis with pBWA. J Appl BioinformComput Biol 1, (2012).[14] A. McKenna et al., The Genome Analysis Toolkit: a MapReduce frame-work for analyzing next-generation DNA sequencing data. Genome Res20, 1297-1303 (2010).[15] G. A. Van der Auwera et al., From FastQ data to high conﬁdence variantcalls: the Genome Analysis Toolkit best practices pipeline. Curr ProtocBioinformatics 43, 11.10.11-11.10.33 (2013).[16] C. Herzeel, P. Costanza, D. Decap, J. Fostier, J. Reumers, elPrep: High-Performance Preparation of Sequence Alignment/Map Files for VariantCalling. PLoS One 10, e0132868 (2015).[17] P. Costanza, C. Herzeel, W. Verachtert, Comparing Ease of Programmingin C++, Go, and Java for Implementing a Next-Generation SequencingTool. Evol Bioinform Online 15, 1176934319869015 (2019).[18] C. Herzeel, P. Costanza, D. Decap, J. Fostier, W. Verachtert, elPrep 4: Amultithreaded framework for sequence analysis. PLoS One 14, e0209523(2019).[19] M. A. Eberle et al., A reference data set of 5.4 million phased humanvariants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res 27, 157-164 (2017).[20] H. Li et al., The Sequence Alignment/Map format and SAMtools.Bioinformatics 25, 2078-2079 (2009).[21] H. Li, A statistical framework for SNP calling, mutation discovery,association mapping and population genetical parameter estimation fromsequencing data. Bioinformatics 27, 2987-2993 (2011).[22] B. Institute. (2019), pp. http://broadinstitute.github.io/picard/.[23] M. Yi et al., Performance comparison of SNP detection tools with illu-mina exome sequencing data–an assessment using both family pedigreeinformation and sample-matched SNP array data. Nucleic Acids Res 42,e101 (2014).[24] N. Kathiresan et al., Accelerating next generation sequencing dataanalysis with system level optimizations. Sci Rep 7, 9058 (2017).[25] M. Plüss et al., Need for speed in accurate whole-genome data anal-ysis: GENALICE MAP challenges BWA/GATK more than PEMap-per/PECaller and Isaac. Proc Natl Acad Sci U S A 114, E8320-E8322(2017).[26] S. Chen, Y. Zhou, Y. Chen, J. Gu, fastp: an ultra-fast all-in-one FASTQpreprocessor. Bioinformatics 34, i884-i890 (2018).13] D. Peters, X. Luo, K. Qiu, P. Liang, Speeding Up Large-Scale NextGeneration Sequencing Data Analysis with pBWA. J Appl BioinformComput Biol 1, (2012).[14] A. McKenna et al., The Genome Analysis Toolkit: a MapReduce frame-work for analyzing next-generation DNA sequencing data. Genome Res20, 1297-1303 (2010).[15] G. A. Van der Auwera et al., From FastQ data to high conﬁdence variantcalls: the Genome Analysis Toolkit best practices pipeline. Curr ProtocBioinformatics 43, 11.10.11-11.10.33 (2013).[16] C. Herzeel, P. Costanza, D. Decap, J. Fostier, J. Reumers, elPrep: High-Performance Preparation of Sequence Alignment/Map Files for VariantCalling. PLoS One 10, e0132868 (2015).[17] P. Costanza, C. Herzeel, W. Verachtert, Comparing Ease of Programmingin C++, Go, and Java for Implementing a Next-Generation SequencingTool. Evol Bioinform Online 15, 1176934319869015 (2019).[18] C. Herzeel, P. Costanza, D. Decap, J. Fostier, W. Verachtert, elPrep 4: Amultithreaded framework for sequence analysis. PLoS One 14, e0209523(2019).[19] M. A. Eberle et al., A reference data set of 5.4 million phased humanvariants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res 27, 157-164 (2017).[20] H. Li et al., The Sequence Alignment/Map format and SAMtools.Bioinformatics 25, 2078-2079 (2009).[21] H. Li, A statistical framework for SNP calling, mutation discovery,association mapping and population genetical parameter estimation fromsequencing data. Bioinformatics 27, 2987-2993 (2011).[22] B. Institute. (2019), pp. http://broadinstitute.github.io/picard/.[23] M. Yi et al., Performance comparison of SNP detection tools with illu-mina exome sequencing data–an assessment using both family pedigreeinformation and sample-matched SNP array data. Nucleic Acids Res 42,e101 (2014).[24] N. Kathiresan et al., Accelerating next generation sequencing dataanalysis with system level optimizations. Sci Rep 7, 9058 (2017).[25] M. Plüss et al., Need for speed in accurate whole-genome data anal-ysis: GENALICE MAP challenges BWA/GATK more than PEMap-per/PECaller and Isaac. Proc Natl Acad Sci U S A 114, E8320-E8322(2017).[26] S. Chen, Y. Zhou, Y. Chen, J. Gu, fastp: an ultra-fast all-in-one FASTQpreprocessor. Bioinformatics 34, i884-i890 (2018).