[PDF] Computer Architecture-Aware Optimisation of DNA Analysis Systems

Abstract

DNA sequencing is revolutionising the field of medicine. DNA sequencers, the machines which perform DNA sequencing, have evolved from the size of a fridge to that of a mobile phone over the last two decades. The cost of sequencing a human genome also has reduced from billions of dollars to hundreds of dollars. Despite these improvements, DNA sequencers output hundreds or thousands of gigabytes of data that must be analysed on computers to discover meaningful information with biological implications. Unfortunately, the analysis techniques have not kept the pace with rapidly improving sequencing technologies. Consequently, even today, the process of DNA analysis is performed on high-performance computers, just as it was a couple of decades ago. Such high-performance computers are not portable. Consequently, the full utility of an ultra-portable sequencer for sequencing in-the-field or at the point-of-care is limited by the lack of portable lightweight analytic techniques. This thesis proposes computer architecture-aware optimisation of DNA analysis software. DNA analysis software is inevitably convoluted due to the complexity associated with biological data. Modern computer architectures are also complex. Performing architecture-aware optimisations requires the synergistic use of knowledge from both domains, (i.e, DNA sequence analysis and computer architecture). This thesis aims to draw the two domains together. In this thesis, gold-standard DNA sequence analysis workflows are systematically examined for algorithmic components that cause performance bottlenecks. Identified bottlenecks are resolved through architecture-aware optimisations at different levels, i.e., memory, cache, register and processor. The optimised software tools are used in complete end-to-end analysis workflows and their efficacy is demonstrated by running on prototypical embedded systems.

Full PDF

CComputer Architecture-AwareOptimisation of DNA Analysis Systems

Hasindu Gamaarachchi

A thesis in fulﬁlment of the requirements for the degree ofDoctor of PhilosophySchool of Computer Science and EngineeringFaculty of EngineeringThe University of New South WalesNovember 2020 a r X i v : . [ q - b i o . GN ] J a n HE UNIVERSITY OF NEW SOUTH WALESThesis/Dissertation Sheet

Surname or Family name:

Gamaarachchi

First name:

Hasindu

Other name/s:

Malshan

Abbreviation for degree as given in the University calendar:

PhD

School:

School of Computer Science and Engineering

Faculty:

Faculty of Engineering

Title:

Computer Architecture-Aware Optimisation of DNA Analysis Systems

Abstract

DNA sequencing—the process that converts chemically encoded data in DNA molecules into a computer-readable form—isrevolutionising the ﬁeld of medicine. DNA sequencers, the machines which perform DNA sequencing, have evolved from the sizeof a fridge to that of a mobile phone over the last two decades. The cost of sequencing a human genome also has reduced frombillions of dollars to hundreds of dollars. Despite these improvements, DNA sequencers output hundreds or thousands of gigabytesof data that must be analysed on computers to discover meaningful information with biological implications. Unfortunately, theanalysis techniques have not kept the pace with rapidly improving sequencing technologies. Consequently, even today, the processof DNA analysis is performed on high-performance computers, just as it was a couple of decades ago. Such high-performancecomputers are not portable. Consequently, the full utility of an ultra-portable sequencer for sequencing in-the-ﬁeld or at thepoint-of-care is limited by the lack of portable lightweight analytic techniques.This thesis proposes computer architecture-aware optimisation of DNA analysis software. DNA analysis software is inevitablyconvoluted due to the complexity associated with biological data. Modern computer architectures are also complex. Performingarchitecture-aware optimisations requires the synergistic use of knowledge from both domains, (i.e, DNA sequence analysis andcomputer architecture). This thesis aims to draw the two domains together. In this thesis, gold-standard DNA sequence analysisworkﬂows (a workﬂow is a few software tools executed sequentially where each software tool is a complex system of dozens ofalgorithms) are systematically examined for algorithmic components that cause performance bottlenecks. Identiﬁed bottlenecksare resolved through architecture-aware optimisations at diﬀerent levels, i.e., memory, cache, register and processor levels. Theoptimised software tools are used in complete end-to-end analysis workﬂows and their eﬃcacy is demonstrated by running onprototypical embedded systems. The embedded systems are not only fully functional, but the performance is also comparable toan unoptimised workﬂow on a high-performance computer. Such low cost, energy-eﬃcient, suﬃciently fast and portable embeddedsystems enable complete DNA analysis at the point-of-care or in-the-ﬁeld.

Declaration relating to disposition of project thesis/dissertation

I hereby grant the University of New South Wales or its agents a non-exclusive licence to archive and to make available (includingto members of the public) my thesis or dissertation in whole or part in the University libraries in all forms of media, now or hereafter known. I acknowledge that I retain all intellectual property rights which subsist in my thesis or dissertation, such as copyrightand patent rights, subject to applicable law. I also retain the right to use all or part of my thesis or dissertation in future works(such as articles or books).For any substantial portions of copyright material used in this thesis, written permission for use has been obtained, or the copyrightmaterial is removed from the ﬁnal public version of the thesis.Signature

Hasindu Gamaarachchi

Witness Date

17 November, 2020FOR OFFICE USE ONLY

Date of completion of requirements for Award riginality Statement

I hereby declare that this submission is my own work and to the best of my knowledgeit contains no materials previously published or written by another person, or substantialproportions of material which have been accepted for the award of any other degree or diplomaat UNSW or any other educational institution, except where due acknowledgement is madein the thesis. Any contribution made to the research by others, with whom I have worked atUNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectualcontent of this thesis is the product of my own work, except to the extent that assistance fromothers in the project’s design and conception or in style, presentation and linguistic expressionis acknowledged.

Hasindu Gamaarachchi

17 November, 2020 opyright Statement

I hereby grant the University of New South Wales or its agents a non-exclusive licence toarchive and to make available (including to members of the public) my thesis or dissertationin whole or part in the University libraries in all forms of media, now or here after known.I acknowledge that I retain all intellectual property rights which subsist in my thesis ordissertation, such as copyright and patent rights, subject to applicable law. I also retain theright to use all or part of my thesis or dissertation in future works (such as articles or books).For any substantial portions of copyright material used in this thesis, written permission foruse has been obtained, or the copyright material is removed from the ﬁnal public version ofthe thesis.

Hasindu Gamaarachchi

17 November, 2020

Authenticity Statement

I certify that the Library deposit digital copy is a direct equivalent of the ﬁnal oﬃciallyapproved version of my thesis.

Hasindu Gamaarachchi

17 November, 2020 bstract

DNA sequencing—the process that converts the massive amount of chemically encoded datain DNA molecules into a computer-readable form—is revolutionising the ﬁeld of medicinethrough a variety of applications such as precision medicine, accurate diagnostics and identi-fying disease predisposition. DNA sequencing also has many other applications in areas suchas epidemiology, forensics and evolutionary biology. DNA sequencers, the machines whichperform DNA sequencing, have evolved from the size of a fridge to that of a mobile phoneover the last two decades. The cost of sequencing a complete human genome has remarkablyreduced from billions of dollars to hundreds of dollars over this time. The size of a DNAsequencer is expected to become even smaller and the sequencing cost per genome is expectedto be even more aﬀordable in the future. Thus, DNA tests are likely to be performed asroutinely and cost-eﬀectively as today’s blood tests. Despite the reduction in size and cost,DNA sequencers output hundreds or thousands of gigabytes of necessary data to account forerrors made during the sequencing process. This data must be analysed on computers to dis-cover meaningful information (for instance, mutations and epigenetic modiﬁcations) that havebiological implications. Unfortunately, the analysis techniques have not kept the pace withrapidly improving sequencing technologies. Consequently, even today, the process of DNAanalysis is performed on high-performance computers, just as it was a couple of decades ago.Such high-performance computers are not portable, unlike mobile phone-sized ultra-portablesequencers. Consequently, the full utility of an ultra-portable sequencer for sequencing in-the-ﬁeld or at the point-of-care is limited by the lack of portable lightweight analytic techniques.A primary reason for this lag between the two technologies is because sequence analysis soft-ware tools written by computational biologists with the focus on higher accuracy of the resultsare un-optimised to eﬃciently utilise computational resources (i.e. software does not map wellto the architecture of computers). This thesis proposes computer architecture-aware optimi-sation of DNA analysis software. DNA analysis software is inevitably convoluted due to thecomplexity associated with biological data. Modern computer architectures are also complex.Performing architecture-aware optimisations requires the synergistic use of knowledge fromboth domains, (i.e, DNA sequence analysis and computer architecture). Computer architec-ture knowledge helps the eﬃcient mapping and exploitation of existing hardware resources,iiihile the understanding of DNA sequence analysis ensures that the ﬁnal accuracy of theresults is intact. In a nutshell, this thesis aims to draw the two domains together.In this thesis, gold-standard DNA sequence analysis workﬂows (a workﬂow is a few softwaretools executed sequentially where each software tool is a complex system of dozens of al-gorithms) are systematically examined for algorithmic components that cause performancebottlenecks. Identiﬁed bottlenecks are resolved through architecture-aware optimisations atdiﬀerent levels, i.e., memory level, cache level, register level and processor level. Some ex-ample optimisations are: 1, the cache-friendly optimisation of de Bruijn graph constructionthat is a time-consuming core-component in a branch of software tools called variant callers(2X performance improvement); 2, memory capacity optimisation of reference indexes for theprocess called read alignment (from 16GB up to 2GB); 3, memory and processor level opti-misation (for CPU-GPU heterogeneous systems) of an important time-consuming algorithmcalled adaptive banded event alignment used for the latest nanopore sequencing technology(3-5X performance improvement). Instead of merely performing algorithmic optimisations,those optimised versions are integrated back to the software and it is demonstrated that thereis global eﬃciency and the accuracy is unaﬀected. Finally, the optimised software tools areused in complete end-to-end analysis workﬂows and their eﬃcacy is demonstrated by runningon prototypical embedded systems. The embedded systems are not only fully functional,but the performance is also comparable to an unoptimised workﬂow on a high-performancecomputer. The practicality of these embedded systems has been demonstrated by integratinginto the sequencing facility at the Garvan Institute of Medical Research in Sydney. Such lowcost, energy-eﬃcient, suﬃciently fast and portable embedded systems enable complete DNAanalysis at the point-of-care or in-the-ﬁeld. Work conducted under this thesis also contributesto the bioinformatics community through contributions to popular bioinformatics tools (i.e.

Platypus , Minimap2 and

Nanopolish ) and the design and development of novel open-sourcebioinformatics software ( f5c ). iv cknowledgement

I wish to express my deepest gratitude to my supervisors

Prof Sri Parameswaran , DrMartin A. Smith and

Dr Aleksandar Ignjatovic for the amazing supervision. Theirenthusiasm, encouragement, advice and attitude were too spectacular that I do not haveenough words to explain. Due to their great supervision, the time during the PhD was veryproductive, leading to signiﬁcant outcomes, at the same time being enjoyable.I am indebted to

Hassaan Saadat , my fellow lab mate at UNSW, for the unwavering support,ingenious suggestions and encouragements. It was thanks to Hassaan that I participated inthe ACM SRC that I eventually became a grand ﬁnalist. I am extremely grateful to

JamesFerguson , my fellow lab mate at Garvan Institute, for countless insights and generouslysharing unparalleled knowledge.I am also grateful to

Dr Warren Kaplan and

Prof John Mattick for identifying my talentand providing the opportunity to collaborate with the Garvan Institute of Medical Research,which was a valuable turning point in the PhD.I would like to extend my sincere thanks to

Arash Bayat and

Vikkitharan Gnanasam-bandapillai who were fellow PhD candidates at UNSW Sydney and also

Dr Bruno Gaeta at UNSW for the initial induction to the genomics ﬁeld. I am also grateful to my progressreview panel who provided constructive advice and encouragement.Many thanks to all current and former colleagues in the embedded systems research group atUNSW and Genomics Technologies group at the Garvan Institute for the support provided atmultiple occasions, especially,

Dr Darshana Jayasinghe , Dr Jorgen Peddersen , Hsu-Kang Dow , Dr Tuo Li , Shaun Carswell and

Dr Ira Deveson . Thanks should also go toData-Intensive Computer Engineering (DICE) group at Garvan Institute for helpful adviceand practical suggestions. I would like to express my deepest appreciation to

Dr RoshanRagel , my undergraduate-project supervisor, who played a decisive role in selecting Prof SriParameswaran as my PhD supervisor and also a great amount of assistance during the wholePhD application process. I would like to extend my sincere thanks to all the lecturers atvhe Department of Computer Engineering of the University of Peradeniya for setting a solidfoundation for my career.I gratefully acknowledge the assistance from

Dr Heng Li , the author of

Minimap2 , and

Dr Jared Simpson , the author of nanopolish , in understanding the code, providing withvaluable insights and suggestions.I very much appreciate the invaluable contributions to the software repositories from un-dergraduate students:

Chun Wai Lam (UNSW),

Gihan Jayatilaka (University of Per-adeniya),

Hiruna Samarakoon (University of Peradeniya) and

Thomas Daniell (UNSW).Last but not least, I thank my parents, my brother, relatives, all my former teachers and allmy friends.Funding: I acknowledge the UNSW Tuition Fee Scholarship and UNSW conference funding(Postgraduate Research Student Support and CSE HDR Student Travel) . I also would liketo acknowledge the travel bursaries from Oxford Nanopore Technologies and the ACM SRCTravel Award. I also appreciate the NVIDIA corporation for donating the Jetson TX2 andTesla K40 GPUs used for experiments in this thesis. I would have been more grateful had the UNSW oﬀered me a prestigious IPRS scholar-ship as the Australian National University did. Appraisal of Prof Sri Parameswaran by hisformer students for his astounding supervision was the major factor in selecting UNSW thatI witnessed my self with no regret vi ublications and Presentations

List of Publications

This thesis has led to the following ﬁrst author journal publications and they are included inlieu of chapters.•

H. Gamaarachchi , A. Bayat, B. Gaeta, and S. Parameswaran, “Cache Friendly Opti-misation of de Bruijn Graph based Local Re-assembly in Variant Calling,” IEEE/ACMtransactions on computational biology and bioinformatics, 2018. DOI: https://doi.org/10.1109/TCBB.2018.2881975 • H. Gamaarachchi , S. Parameswaran, and M. A. Smith, “Featherweight long readalignment using partitioned reference indexes,” Scientiﬁc Reports 9, 4318 (2019). DOI: https://doi.org/10.1038/s41598-019-40739-8 • H. Gamaarachchi , C. W. Lam, G. Jayatilaka, H. Samarakoon, J. T. Simpson, M.A. Smith, and S. Parameswaran, “GPU Accelerated Adaptive Banded Event Alignmentfor Rapid Comparative Nanopore Signal Analysis,” BMC Bioinformatics 21, 343 (2020).DOI: https://doi.org/10.1186/s12859-020-03697-x , 2020.• H. Gamaarachchi , H. Saadat, S. Parameswaran, "Optimisation of Nanopore SequenceAnalysis Software for Many-core CPUs", prepared for submission [in progress], 2020.This thesis has also led to the following article in the ACM SRC Grand Finals.• “ESWEEK: G: Real-time, Portable and Lightweight Nanopore DNA Sequence Anal-ysis using System-on-Chip”, ACM SRC Grand Finals, 2020. URL: https://src. Also available as a pre-print in bioRxiv, 2019, DOI: https://doi.org/10.1101/756122 vii cm.org/binaries/content/assets/src/2020/hasindu-gamaarachchi.pdf — thirdplace Grand Finalist Collaborative research conducted in close relation to the work presented under this thesis hasled to the following publications and pre-prints which are not included in the thesis.• H. Samarakoon, S. Punchihewa, A. Senanayake, J. M. Hammond, I. Stevanovski, J.M.Ferguson, R. Ragel,

H. Gamaarachchi and I. W. Deveson, “Genopo: a nanoporesequencing analysis toolkit for portable Android devices,” Communications biology 3,538 (2020) DOI: https://doi.org/10.1038/s42003-020-01270-z • R. P. Mohanty,

H. Gamaarachchi , A. Lambert, and S. Parameswaran, “SWARAM:Portable Energy and Cost Eﬃcient Embedded System for Genomic Processing,” ACMTransactions on Embedded Computing Systems (TECS) 18.5s (2019). DOI: https://doi.org/10.1145/3358211 • A. Bayat,

H. Gamaarachchi https://doi.org/10.20944/preprints202006.0324.v1 • A.F. Laguna,

H. Gamaarachchi , X. Yin, M.Niemier, S. Parameswaran and X. S. Hu,“Seed-and-Vote based In-Memory Accelerator for DNA Read Mapping”, ICCAD 2020[accepted]

List of Presentations

Oral presentations: • "Performance Optimisation of Nanopore DNA Analysis Software: A Computer Archi-tecture Aware Approach," Australasian Leadership Computing Symposium (ALCS),2019. URL: https://opus.nci.org.au/display/Help/Genomics+Stream?preview=/48497246/50236166/ALCS_Genomics_Gamaarachchi_released.pdf • "Lightweight, Portable and Real-time Embedded Computing Systems for DownstreamNanopore Data Processing", London Calling, 2020. URL: • "Real-time, Portable and Lightweight Nanopore DNA Sequence Analysis using System-on-Chip", ACM SRC second-round at ESWEEK, 2019. —

First place and entryinto the SRC Grand Finals viii oster presentations: • "Portable Real-time Genomic Data Processing: Harmonising Bioinformatics Software toExploit Hardware", Australasian Genomic Technologies Association Conference (AGTA)2019. —

Best student poster award • "Real-time, Portable and Lightweight Nanopore DNA Sequence Analysis using System-on-Chip", ACM SRC ﬁrst-round at ESWEEK, 2019. —

Entry into the second-round ix wards The work conducted under this thesis has received the following awards.• Grand Finalist (third place winner), ACM SRC Grand Finals graduate category, 2020• First Place, ACM SIGBED SRC at ESWEEK, 2019• Best Student Poster Award, Australasian Genomic Technologies Association Conference(AGTA), 2019• Runner-up, UNSW 3 Minute Thesis School level, 2019x ontents

Abstract iiiAcknowledgement vPublications and Presentations viiAwards xContents xiList of Figures xxiList of Tables xxvii1 Introduction 1 f5c compared with original

Nanopolish . . . . . . . . . 1845.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1865.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

A Appendix: Featherweight Long Read Alignment 258

A.1 Supplementary Note 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259A.1.1 Serialising (dumping) of the internal state . . . . . . . . . . . . . . . . . 259A.1.2 Merging operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260xvi.1.3 Emulated single reference index . . . . . . . . . . . . . . . . . . . . . . . 261A.2 Chromosome Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262A.2.1 Memory eﬃciency for references with unbalanced lengths . . . . . . . . 262A.3 Supplementary Note 3 - Instructions to run the tools . . . . . . . . . . . . . . . 264A.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264A.3.2 Index construction with chromosome size balancing . . . . . . . . . . . . 265A.3.3 Running Minimap2 on a partitioned index with merging . . . . . . . . . 266

B Appendix: f5c

Documentation 267

B.1 Readme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267B.1.1 Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268B.1.2 Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268B.1.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270B.1.4 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271B.2 Building f5c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271B.2.1 Method 1 (recommended) . . . . . . . . . . . . . . . . . . . . . . . . . . 272B.2.2 Method 2 (time consuming) . . . . . . . . . . . . . . . . . . . . . . . . . 272B.2.3 Method 3 (not recommended) . . . . . . . . . . . . . . . . . . . . . . . . 273B.2.4 Docker Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273B.2.5 CUDA Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . 274B.2.6 Compiling Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274B.2.7 Runtime Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276B.2.8 Commands and options . . . . . . . . . . . . . . . . . . . . . . . . . . . 277B.2.9 Available f5c tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277xvii

Appendix: f5c

C.1 Why Nanopolish had to be re-engineered? . . . . . . . . . . . . . . . . . . . . . 283C.2 Additional advantages of f5c over

Nanopolish . . . . . . . . . . . . . . . . . . . 284

D Appendix: Portable Binaries 286

D.1 Key points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287D.2 A case study with f5c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289D.2.1 Note on CUDA libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 290D.2.2 Example commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

E Appendix: Rock64-cluster and f5p

E.1 Rock64-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293E.1.1 Required Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293E.1.2 Connecting nodes together . . . . . . . . . . . . . . . . . . . . . . . . . 294E.1.3 Setting up the head node . . . . . . . . . . . . . . . . . . . . . . . . . . 295E.1.4 Compiling software and preparing the folder structure . . . . . . . . . . 296E.1.5 Setting up woker nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 297E.2 f5p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298E.2.1 Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298E.2.2 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298E.2.3 Building and initial conﬁguration . . . . . . . . . . . . . . . . . . . . . . 298E.2.4 Running for a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299xviii

Appendix: Bioinformatics on Mobile Phone 301

F.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302F.2 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303F.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305F.3.1 minimap2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305F.3.2 Samtools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308F.3.3 F5C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310F.3.4 Nanopolish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312F.4 Running Directly on Phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313F.4.1 On Android 7.0 or before . . . . . . . . . . . . . . . . . . . . . . . . . . 313F.4.2 On Android 8.x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314F.4.3 Is there a proper way? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

G Appendix: Open-source Contributions 319

G.0.1 User comments for f5c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319G.0.2 Contributions to

Minimap2 . . . . . . . . . . . . . . . . . . . . . . . . . 321G.0.3 Contributions to

Nanopolish . . . . . . . . . . . . . . . . . . . . . . . . . 324

H Appendix: I/O Optimisations 328

H.1 Extended Motivational Example on another System . . . . . . . . . . . . . . . 328H.2 Extended Bottleneck using Another Dataset . . . . . . . . . . . . . . . . . . . . 331H.3 Extended Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332H.3.1 Results from Alternate File Format (SLOW5) for Another Dataset . . . 332H.3.2 Impact of Proposed Solutions on Disk IOPS . . . . . . . . . . . . . . . . 333xix

Appendix: Poster Presentations 337References 340 xx ist of Figures FASTA ﬁle format . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Elaboration of SNV and Indels . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 An example of VCF ﬁle format . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5 Elaboration of the concept of coverage in sequencing. . . . . . . . . . . . . . . 282.6 An example of

FASTQ ﬁle format . . . . . . . . . . . . . . . . . . . . . . . . . 302.7 First-generation sequencers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.8 Illumina second-generation sequencers . . . . . . . . . . . . . . . . . . . . . . . 332.9 Paciﬁc Biosciences Sequel Sequencer. . . . . . . . . . . . . . . . . . . . . . . . . 352.10 Nanopore third-generation sequencers . . . . . . . . . . . . . . . . . . . . . . . 37xxi.11 ONT MinIT, PromethION compute tower and MinION Mk1C . . . . . . . . . 382.12 Simpliﬁed second-generation workﬂow . . . . . . . . . . . . . . . . . . . . . . . 392.13 GATK Best Practices pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.14 Simpliﬁed illustration of aligned sequence reads to a reference . . . . . . . . . . 422.15 Dynamic programming based sequence alignment . . . . . . . . . . . . . . . . . 582.16 Simpliﬁed elaboration of variant calling . . . . . . . . . . . . . . . . . . . . . . 592.17 A screenshot from IGV from an NA12878 dataset . . . . . . . . . . . . . . . . . 592.18 Simpliﬁed third-generation nanopore workﬂow . . . . . . . . . . . . . . . . . . . 602.19 Evolution of dynamic programming-based sequence alignment . . . . . . . . . . 612.20 Read length distribution nanopore consortium . . . . . . . . . . . . . . . . . . . 622.21 A screenshot from IGV from an NA12878 nanopore dataset . . . . . . . . . . . 633.1 Distribution of execution time for Platypus variant caller . . . . . . . . . . . . 683.2 A region of the reference and few mapped reads . . . . . . . . . . . . . . . . . . 703.3 De Bruijn graph for the region . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.4 The hash table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.5 Elaboration of how mapping information is used for improving cache performance 763.6 Summary of the outcome of the proposed method . . . . . . . . . . . . . . . . . 823.7 Distribution of memory accesses in the optimised implementation . . . . . . . . 833.8 Execution time for the baseline implementation and the modiﬁed implementation 864.1 Eﬀect of parameters on memory usage, performance and accuracy . . . . . . . 954.2 Eﬀect of the window size parameter on the MAPQ distribution for syntheticspike-in controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.3 Eﬀect of aligning sequences to single vs partitioned indexes . . . . . . . . . . . 98xxii.4 Eﬀect of using partitioned indexes versus a single reference index on alignmentquality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.5 Features of alignments that were discordantly mapped in single-idx and indexing strategies . . . . . . . . . . . . . . . . . . . . . . . . . 1044.6 Genome browser screenshots of alignments unique to the that do not overlap annotated repeats . . . . . . . . . . . . . . . . . . . . . . . 1054.7 Distribution of MAPQ and DP alignment score for the disparate mappings(diﬀerent by at least one base position) between single-idx and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.8 Scatter plot of alignment scores of single-idx vs for disparatealignments (diﬀerent by at least one base position) . . . . . . . . . . . . . . . . 1074.9 Distribution of genomic features of disparate alignments (diﬀerent by at leastone base position) between single-idx and . . . . . . . . . . 1084.10 Statistics and genomic features of disparate mappings (mapping positions notoverlapping at least by 10%) between single-idx and . . . . 1094.11 Alignment of an ultra-long Nanopore read from a chromothriptic region . . . . 1104.12 Peak memory usage and runtime for a partitioned index of the GRCh38 humanreference genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.13 Eﬀect of the window size on the error rate and sensitivity for simulated reads . 1225.1 Nanopore portable sequencer and associated data analysis . . . . . . . . . . . . 1285.2 Illustration of a nanopore raw signal, events and pore-model . . . . . . . . . . . 1345.3 example nanopore read length distributions . . . . . . . . . . . . . . . . . . . . 1345.4 Adaptive Banded Event Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 1535.5 Thread conﬁguration of pre-kernel . . . . . . . . . . . . . . . . . . . . . . . . . 1555.6 Thread assignment of pre-kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.7 Utility of kcache in the core-kernel to improve memory coalescing . . . . . . . . 1615.8 Decision trees for resource optimisation . . . . . . . . . . . . . . . . . . . . . . 174xxiii.9 Eﬀect of individual optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . 1825.10 Speedup of ABEA on GPU compared to CPU . . . . . . . . . . . . . . . . . . . 1845.11 Comparison of f5c to Nanopolish . . . . . . . . . . . . . . . . . . . . . . . . . . 1865.12 Human genome processing on-the-ﬂy . . . . . . . . . . . . . . . . . . . . . . . . 1876.1 Hardware architecture of the proposed system . . . . . . . . . . . . . . . . . . . 1926.2 Rock64-cluster placed alongside the nanopore sequencers at Garvan Instituteof Medical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1976.3 Construction of the Rock64-cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 1996.4 Methylation calling workﬂow and its software tools . . . . . . . . . . . . . . . . 2006.5 Screenshot from Ganglia monitoring system . . . . . . . . . . . . . . . . . . . . 2026.6 Screenshot from LogAnalyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2036.7 NVIDIA Jetson development boards. Photograph credits: Hsu-Kang Dow. . . . 2046.8 Comparison of Jetson TX2, Jetson Nano and Rock64 based on the single SBCexecution times for the whole dataset . . . . . . . . . . . . . . . . . . . . . . . . 2086.9 Execution time on individual SBCs per each batch in the dataset . . . . . . . . 2096.10 The comparison of the sequencing rate with the data analysis rate over theduration of the sequencing run . . . . . . . . . . . . . . . . . . . . . . . . . . . 2116.11 Comparison of proposed architecture on the Rock64-cluster with the originalpipeline running on an HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2126.12 Methylation calling workﬂow on an Android mobile phone . . . . . . . . . . . . 2146.13 Potential applications of real-time methylation calling . . . . . . . . . . . . . . 2157.1 Variation of (a) execution time, (b) CPU utilisation and core-hours in original

Nanopolish with the number of data processing threads. . . . . . . . . . . . . . 2197.2 Elaboration of synchronous I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . 222xxiv.3 Elaboration of multi threaded synchronous I/O . . . . . . . . . . . . . . . . . . 2237.4 Elaboration of asynchronous I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 2247.5 Decomposition of time for individual components in restructured

Nanopolish . . 2287.6 Elaboration of the limitation in HDF library. . . . . . . . . . . . . . . . . . . . 2317.7 The proposed multi-process based solution. . . . . . . . . . . . . . . . . . . . . 2337.8

Flow diagram depicting modiﬁcations to

Nanopolish . . . . . . . . . . . . . . . . . . 2357.9 Overall execution time and CPU utilisation when SLOW5 format is used . . . 2417.10 Comparison of FAST5 vs SLOW5 access . . . . . . . . . . . . . . . . . . . . . . 2417.11 Overall results for multi-process pool . . . . . . . . . . . . . . . . . . . . . . . . 2427.12 FAST5 ﬁle access using multiple I/O threads vs I/O processes . . . . . . . . . . 2437.13 Single-FAST5 vs Multi-FAST5 using I/O threads . . . . . . . . . . . . . . . . . 2457.14 Single-FAST5 vs Multi-FAST5 using I/O processes . . . . . . . . . . . . . . . . 2467.15 Performance on NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247F.1 CPU and RAM usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309F.2 Executing

Minimap2 using terminal emulator . . . . . . . . . . . . . . . . . . . 316F.3 Remote ADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317F.4 Execution using remote ADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318H.1 Variation of runtime of original Nanopolish with the number of threads . . . . 329H.2 CPU utilisation with the runtime . . . . . . . . . . . . . . . . . . . . . . . . . . 330H.3 Ineﬃciency of multi-threaded I/O to the HDF5 library . . . . . . . . . . . . . . 332H.4 Statistics collected using collectl for

Nanopolish . . . . . . . . . . . . . . . . . . 334H.5 Statistics collected using collectl for

SLOW5 . . . . . . . . . . . . . . . . . . . 335xxv.6 Statistics collected using collectl for

FAST5 multi-process pool . . . . . . . . . 336I.1 Poster presented at AGTA 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . 338I.2 Poster presented at ACM SRC at ESWEEK 2019 . . . . . . . . . . . . . . . . . 339xxvi ist of Tables single-idx and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.4 Detailed runtime for partitioned indexes . . . . . . . . . . . . . . . . . . . . . . 1215.1 Data arrays associated with ABEA and their sizes . . . . . . . . . . . . . . . . 1645.2 GPU data arrays, pointer computation and heuristically determined sizes . . . 1655.3 measured quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.4 adjustable user parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.5 Information of the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1785.6 Diﬀerent systems used for experiments . . . . . . . . . . . . . . . . . . . . . . . 1786.1 Execution results of several nanopore datasets of the human genome on theRock64-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206xxvii.1 Example of SLOW5 ﬁle format . . . . . . . . . . . . . . . . . . . . . . . . . . . 2367.2 Example of SLOW5 index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2377.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2387.4 Computer systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239H.1 An Oxford Nanopore PromethION dataset . . . . . . . . . . . . . . . . . . . . 332H.2 System used for experiment in section H.1 . . . . . . . . . . . . . . . . . . . . . 332xxviii hapter 1

Introduction

Humankind technologically advanced to read or ‘sequence’ their own DNA—nature’s blueprintof a living organism—only a few decades ago. This technological advancement was a turningpoint for healthcare and medicine and led to a new era of medicine that is more precise andtailored to an individual than traditional evidence-based medicine. Sequencing of the ﬁrst-ever human DNA started as an international eﬀort called the human genome project in 1990and took 13 years to complete in 2003 [1], at a cost of 2.7 billion USD [2]. Since then, DNAsequencing technologies advanced at a remarkable pace over the last few decades up to apoint where today the human genome can be sequenced in just two days at a cost less than1000 USD [3]. This rapid pace of advancement is continuing, and this sequencing processis becoming possible within a few hours [4] at a cost of less than 100 USD [5]. Thus, DNAsequencing tests are likely to become routine as of today’s blood tests in the near future.Precision medicine considers the variability in genes, environment and lifestyle amongst dif-ferent individuals to guide the use of eﬀective and safe treatments tailored to a particularindividual [6]. This is in contrast to the one-size-ﬁts-all approach in traditional evidence-based medicine that targets the average person [6]. Clinical genomics that considers the1HAPTER 1. INTRODUCTIONinformation encoded in the DNA is being increasingly incorporated into precision medicineprotocols [7]. One of the ten highest-grossing drugs (in the USA), rosuvastin —a statin usedto lower blood cholesterol—is shown to beneﬁt only 1 in 20 [8]. The best beneﬁt ratio forany of the top 10 grossing drugs is 1 in 4, which is still considerably low [8]. In traditionalevidence-based medicine, selecting the most eﬀective drug out of available subtypes of drugsfor a particular patient is a somewhat trial-and-error process, which is ineﬃcient [9]. In preci-sion medicine, genetic information gathered from the sequencing of individuals’ DNA is beingincreasingly used to determine the most eﬀective drug. For instance, the latest anti-cancerdrugs such as crizotinib that treat anaplastic lymphoma kinase (ALK) positive lung canceralready require genetic testing of the patient [10]. While genetic testing today is mainly per-formed for critical cases, it is expected to become more and more frequent in the future withthe cost and availability of DNA sequencing improving rapidly.Another use of DNA sequencing is for implementing a more proactive approach, "prevention isbetter than cure". The genetic information of an individual can determine the predispositionto a number of diseases, thus making it possible to implement preventive measures. A verypopular example is the actress Angelina Jolie who underwent a double mastectomy in 2013after a genetic test that revealed a signiﬁcantly higher chance of developing breast cancer.While the cost of a genetic test in 2013 was probably not aﬀordable for everyone, today theyare becoming more and more realistic and common.DNA sequencing is also beneﬁcial in epidemiological applications. In the recent past, dur-ing the Ebola virus outbreak in West Africa (2013–2016) and Zika virus outbreak in Brazil(2015-2016) DNA sequencing has been utilised for viral surveillance. Today, the utility ofDNA sequencing is evident than ever before, due to the ongoing COVID-19 pandemic. DNAsequencing of the viral sequence allows identiﬁcation of mutations that facilitates the trackingof the disease spread and provides insights into the virus evolution that are useful in vaccinedevelopment. 2n addition to the above, DNA sequencing is also applicable in several other ﬁelds such asforensics, evolutionary biology and agronomy.DNA sequencing alone is of limited utility if not for the sequence analysis, a very heavy com-putational analysis that follows the actual sequencing. DNA sequencers—the machines thatsequence the DNA—read the DNA sequence in small fragments called ‘reads’. Computationalanalyses must be performed to put these reads together to achieve a draft sequence assemblyclose as possible to the original DNA sequence or to compare against an existing referenceof the original DNA sequence. This two-step process, (a) DNA sequencing and (b) sequenceanalysis are depicted in Fig. 1.1.The input to the DNA sequencing process (Fig 1.1) is a tissue sample of a living organism (e.g.,blood). Such a sample contains billions of cells and each and every cell contains a homologouscopy of the DNA sequence that is millions of molecular bases long . The full DNA sequenceinside a single cell of a human is 3.1 billion bases long and when printed is a series of booksthat accommodate a whole bookshelf (Fig. 1.2). This long DNA sequence is tightly packedinside the cell and the sample preparation process that unpacks the DNA sequence breaksthe fragile DNA strand into small fragments (Fig. 1.1) . This prepared sample containingtrillions of fragments of DNA from multiple cells are read by an array of sensors in the DNAsequencer and are output as a series of data points that represents the biological sequence(Fig. 1.1).The reads output by the sequencer—tiny fragments coming from multiple copies of the fullDNA sequence—are in random order. The sequence analysis process (Fig. 1.1) that assemblesthese tiny pieces to obtain the original DNA sequence or compares diﬀerences in the reads to a Identical copies if cells are normal, i.e., unless cancer cells. Unintended fragmentation in nanopore sequencing or intended fragmentation in Illuminasequencing. 3HAPTER 1. INTRODUCTIONreference of the original DNA sequence is typically challenging and computationally intensive,mainly due to the following reasons:

DNA sequencer

NAS

HPC HPCHPC HPC

Sequence Analysis WorkflowSoftware 1Software 1 Software 2Software 2 Software 3Software 3 Software 4Software 4Algorithm 2Algorithm 2Algorithm 1Algorithm 1 Algorithm 3Algorithm 3

Algorithm 4Algorithm 4 Algorithm 5Algorithm 5Data

Results

A number of

Parameters (b) Sequence Analysis

NAS

HPC HPCHPC HPC

Sequence Analysis WorkflowSoftware 1 Software 2 Software 3 Software 4Algorithm 2Algorithm 1 Algorithm 3

Algorithm 4 Algorithm 5Data

Results

A number of

Parameters (b) Sequence Analysis(a) DNA Sequencing

Reads Results

Blood/

Tissue

Sample PreparationSample PreparationDNA strand sensorDNA Extraction Fragmentationand a number of other steps

AGTA..AGGT..

ACCA.. …..

Figure 1.1: DNA sequencing and sequence analysis1. Reads are tiny compared to the original full DNA sequence (reads are around 75-500 bases in second-generation sequencers and around 1,000-100,000 bases in third-generation sequencers where the full DNA sequence is typically millions of bases long).2. The reference sequence used for comparison is somewhat diﬀerent to the DNA sequencein a sample (around 0.5% diﬀerence in humans due to genetic variation between twohumans [12]) and the error caused by sequencers when reading the DNA is comparativelylarge (around 0.1%-1% in second-generation sequencers and around 0.5%-13% in third-4igure 1.2: Human DNA sequence printed as a series of books displayed at Wellcome Col-lection, Euston Square, London. Photograph from [11] licensed under Creative CommonsAttribution-ShareAlike 2.0 Generic (CC BY-SA 2.0).generation sequencers).3. The complexity of the human genome (i.e., more than 50% in the human DNA sequenceare repeat regions [13, 14] and there are numerous types of repeat regions with distinctcharacteristics).4. The large data volume output by the sequencers (can be as high as hundreds of gigabytesor even a few terabytes).Over the last two decades, a plethora of workﬂows that perform DNA sequence analysishas emerged. A sequence analysis workﬂow is very sophisticated that a single workﬂow is5HAPTER 1. INTRODUCTIONa pipeline of diﬀerent software tools run one after the other (Fig. 1.1) and each single soft-ware tool is a collection of numerous algorithms and heuristically determined parameters.Computational biologists or bioinformaticians have attempted to improve the accuracy of thesequence analysis workﬂows increasingly over the years, and as a result, the workﬂows havebecome more sophisticated.

The rapid improvement in DNA sequencing technologies in terms of sequencing cost over thelast two decades is depicted in the graph in Fig. 1.3. The hypothetical line that depictsMoore’s law is to compare how fast the sequencing technologies have improved. From 2001 to2007, the cost of sequencing per human genome has reduced at a rate similar to Moore’s law.Then from 2007 to 2019, the drop in sequencing cost has been faster than Moore’s law, from10 million USD to 1000 USD per human genome. Illumina, a leading sequencing company,has announced that their upcoming technology will bring down this cost to 100 USD in thefuture [5]. This rapid improvement is expected to continue and the sequencing machineswhich were limited to high-end research facilities are slowly arriving into pathology labs.In addition to the improvement of DNA sequencing in terms of cost, the physical size andweight of the DNA sequencers also have improved. Fig. 1.4 shows the evolution of thesequencing machines over the last two decades. ABI Prison 3700 DNA sequencer, a ﬁrst-generation sequencer released in 1999, was similar to the size of a fridge (dimensions 134.62cm x 76.2 cm x 74.93 cm) with a weight over a hundred kilograms. Illumina MiSeq sequencer,a second-generation sequencer released in 2011 was the size of an oven (of the dimensions 68.6cm × × uofuhealth.utah.edu labwrench.com umass.edu ONT MinION ONT SmidgION~ size of a fridge ~size of an oven ~size of a phone ~size of a USB drive

Figure 1.4: Evolution of sequencing machinesyears as a result of biologists improving the accuracy of the results.4. Sequence analysis software tools written by computational biologists or bioinformati-cians with the focus on higher accuracy of the results are un-optimised to eﬃcientlyutilise computational resources, i.e. software does not map well to the architecture ofcomputers.

Sequence analysis steps consume days to weeks if done on a commodity laptop, or not possibleat all in certain cases due to limited memory (RAM) in laptop computers. Hence, clusters ofhigh-performance computers are currently being used, yet the process takes hours to complete.Such super-computers are very costly and massive and are typically available in high-endresearch facilities. As involved data set sizes are massive (can be up to several terabytes),cloud computing that relies on the internet is not ideal.The utility of an ultra-portable DNA sequencer such as the MinION is currently limiteddue to the analysis process being performed on non-portable high-performance computers.There are a number of examples where scientists took these MinION sequencers into the8.2. NEED FOR REDUCING THE GAPﬁeld to perform sequencing. During the 2013-2016 Ebola virus outbreak in West Africa,scientists performed in-the-ﬁeld sequencing using MinION sequencers [16]. However, thesequence analysis had to be performed oﬀsite on high-performance computers in Europe.Gigabytes of data were transferred through a mobile internet connection which was expensiveand slow. Analysis technologies to perform the analysis in-the-ﬁeld would have been valuablein such circumstances, not only to reduce the cost but also for a quick turn-around time ofresults.Another similar example is the use of the MinION during the Zika virus outbreak in Brazil [17],which the utility was again limited due to limitations in analysis technologies. Going beyondrural areas, scientists have performed sequencing using the MinION in jungles, arctic [18] andeven on the international space station [19], which all of them would have beneﬁted by eﬃcientsequence analysis technologies. Currently, the ultra-portability of the MinION sequencer isbeing used to facilitate sequencing of the SARS-CoV-2 in smaller decentralised laboratoriesaround the world [20]. Rapid epidemiological data sharing from places all over the worldis a key to a better public health response. Having ultra-portable analysis technologies willfurther beneﬁt in such circumstances.Better analysis technologies will not only beneﬁt portable applications such as the above butalso in decentralising DNA tests in the future. DNA sequencers that are limited to high-endresearch facilities today will soon arrive in pathology labs and even doctor’s oﬃce. Havingbetter sequence analysis technologies will support the data processing in situ without theneed to transfer data to centralised high-performance computing facilities. Better sequenceanalysis technologies can also beneﬁt large scale sequencing studies where the processing isperformed in centralised high-performance computing facilities, by reducing the computingcost and the turnaround time of the results.Considering all of the above factors, improving analysis technologies to match DNA sequencingtechnologies is a timely need. 9HAPTER 1. INTRODUCTION

Possible solutions to ﬁll the gap between sequencing and analysis technologies are exploredbelow in the context of the four reasons that were stated in the later part of section 1.1.Performance of sequencers has evolved faster than computers that used to follow Moore’slaw (Fig. 1.3). However, general-purpose single-core processor performance only improvedby 3% in the year 2017—much slower than Moore’s law and future improvements shouldfocus on application-speciﬁc hardware, as pointed by pioneers in computer architecture JohnHennessy and David Patterson who received the Turing award in 2018 [21]. Application-Speciﬁc Integrated Circuits (ASIC) or custom circuit chips designed speciﬁcally for sequenceanalysis would be a potential solution to reduce the gap, which had already been applied assolutions in other domains such as digital signal processing. However, the ﬁeld of genomicsbeing still immature and thus the workﬂows rapidly evolving, designing custom hardwareis challenging. Even little changes in algorithms require redesigning the custom hardwareand this design cost is millions of dollars. An option that provides better ﬂexibility thanASIC would be to design Application-Speciﬁc Instruction-set Processors (ASIP), which arein-between versions of general-purpose processors and ASICs in terms of ﬂexibility. However,fabricating such ASIPs would still incur billions of dollars. Field Programmable Gate Arrays(FPGA) could be used as ‘breadboards‘ to prototype ASICs or ASIPS, however, full sequenceanalysis workﬂows are too complicated to be made fully functional on a typical FPGA withlimited resources.The high data volume output by the sequencers is a cause for an increased amount of com-putations, yet is beneﬁcial to account for errors introduced by the sequencers, i.e, errors canbe normalised when one region of a DNA string is covered by multiple reads. If sequencersbecome more and more accurate, the amount of data required to assemble a single DNAsequence will reduce. However, production of such accurate sensors that function at nano-and pico-scale measurements is far ahead in the future.10.3. POSSIBLE SOLUTIONS TO FILL THE GAPThe complexity of the human genome is inevitable, i.e. there are seven categories of repeatedsequences, each category has subcategories, each subcategory has a number of diﬀerent fami-lies, each family has subfamilies and these subfamilies have distinct characteristics [22]. Also,more than 50% of the human genome is composed of repeated sequences [13, 14]. Certainregions (e.g. Telomere, Centromere) in the human genome have been too intractable to theexisting technologies and are yet being resolved at the time of writing [23]. Analysis workﬂowsthat work on such complex genomes are thus inevitably sophisticated. What is meant by so-phisticated here is not the algorithmic time-complexity, instead, the number of idiosyncraticcases that deviates from the general model. When processing, each of these deviated casesrequire to be separately handled. For instance, each family of repeated sequences would needto be processed using diﬀerent algorithms and/or heuristic parameters, leading to a largenumber of code paths. A sequence analysis workﬂow as a whole would thus remain sophisti-cated, however, the time-complexity of each and every algorithm inside the workﬂow can beimproved. Designing better algorithms with lesser time complexity has been and will be oneof the most eﬀective ways to improve performance. Over the past decades, plenty of work hasbeen done in designing better algorithms and this will continue to happen.Sequence analysis software tools are typically designed and developed by computational bi-ologists or bioinformaticians whose major focus is to develop methods that are predicatedon answering a research question or producing a speciﬁc outcome. Typically those computa-tional biologists or bioinformaticians have access to near unlimited computational resources intheir research environment—clusters of high-performance servers with hundreds of gigabytesof RAM. Their focus is not on maximal optimisation of the software, which requires detailedknowledge of computational systems. Consequently, sequence analysis software tools are typ-ically severely un-optimised to eﬃciently run on computing systems with limited resourcessuch as laptops or desktops. In other words, sequence analysis software tools severely lackcomputer architecture-aware optimisations that consider the knowledge of underlying hard-ware architecture. Note that these architecture-aware optimisations are not to be confused11HAPTER 1. INTRODUCTIONwith algorithmic time-complexity optimisations which have already been done to a consid-erably adequate level in current sequence analysis software. Consider a hash table versus acontiguous array in memory. Despite accessing both the hash table and the array having thesame time-complexity, contiguous accesses to an array are tens of times faster than randomaccesses to a hash table in a modern computer due to the presence of memory caches. Suchoptimisations that map existing sequence analysis software components to eﬃciently mapwith complex architectural features in modern computer systems are henceforth referred toas computer architecture-aware optimisations.Out of the solutions discussed above, the most timely solution is to perform architecture-awareoptimisations on existing sequence analysis software. Such optimisations are cost-eﬀective andpractical, yet rarely applied to sequence analysis software. The focus of this thesis is sucharchitecture-aware optimisations on existing DNA sequence analysis software. In additionto the provision of eﬃcient performance on general-purposes computers, such optimisationswould complementary beneﬁt any future-focused ASIP design eﬀorts.

The architecture of modern computer systems is complex. Understanding such complex ar-chitectures requires the knowledge of a number of topics such as:• the memory hierarchy (Fig. 1.5);• interfacing between the processor, memory and co-processors (Fig. 1.5);• internal details of processors and co-processors such as multiple cores, instruction schedul-ing and instructions set architecture; and,12.4. THIS THESIS• low-level software such as operating system (processes, threads, scheduling, disk caches,virtual memory, etc.) and device drivers.Simultaneously, the ﬁeld of DNA analysis is also utterly complex and understanding suchanalysis tools require knowledge of a number of topics such as:• basic molecular biology involving the structure and function of DNA, chromosomes,genome, etc.;• features of DNA sequences such as various types of repetitive sequences;• diﬀerent generations of DNA sequencing technologies;• characteristics of data produced from sequencing technologies such as the read lengthsand error rate; and,• sequence analysis algorithms such as sequence alignment, variant calling and methyla-tion calling.Developing eﬃcient software that conforms to hardware architectures requires the knowledgeof all these computer architecture related topics. Developing sequence analysis software thatproduces accurate results require the knowledge of all above DNA analysis related topics.Thus, architecture-aware optimisation of sequence analysis software requires the simultaneoususe of knowledge from both computer architecture and DNA analysis (Fig. 1.5).DNA sequence analysis software tools are sophisticated and modern computer architecturesare also sophisticated. Computer architecture knowledge helps to eﬃciently utilise resources.At the same time, knowledge of characteristics of biological data and associated algorithmsensures that the accuracy of the ﬁnal results is unaﬀected. The domain knowledge from boththe ﬁelds is utilised for the optimisations (Fig. 1.5). This thesis attempts to bridge the13HAPTER 1. INTRODUCTIONtwo interdisciplinary ﬁelds–computer architecture and DNA analysis—to produce sequenceanalysis software that can eﬃciently utilise existing resources in a modern computer system.

This thesis is about architecture-aware optimisations to existing DNA sequence analysis soft-ware. We present architecture-aware optimisations at diﬀerent levels: Processor level, registerlevel, cache level, RAM level and disk I/O level. The outline of the rest of the thesis is asfollows.In chapter 2, the background required to understand the technical chapters and a detailedliterature review of the state-of-the-art are given. First, the background of DNA and DNAsequencing is presented in a simpliﬁed fashion for a reader from a non-biological background.Then, the background and the state-of-the-art of sequencing analysis workﬂows are presented.Finally, previous eﬀorts of architecture-aware optimisation of DNA analysis workﬂows arepresented.In chapter 3, cache and register level optimisations to a popular variant calling software called

Platypus are presented. A major time-consuming component of this software—de BruijnGraph construction—was improved by a using cache and register level optimisations withoutany impact on the accuracy.In chapter 4, memory (RAM) size optimisation of a popular sequence alignment softwarecalled

Minimap2 is presented. The peak memory usage in

Minimap2 was reduced througha divide and conquer strategy, most importantly, without compromising the accuracy. Thiswork enabled performing sequence analysis in low memory systems such as mobile phones,which was otherwise not possible.Chapter 5 discusses RAM level, cache level and processor level optimisations to a core al-14.4. THIS THESISgorithm component in analysing data produced from Oxford Nanopore sequencers calledthe Adaptive Banded Event Alignment (used in the popular Nanopore analysis toolkit Na-nopolish). This includes how the algorithm was parallelised for CPU-GPU heterogeneousarchitectures. Importantly, the impact on the accuracy of the ﬁnal results is negligible.Chapter 6 discusses how the optimisations proposed in chapters 3,4,5 were integrated todevelop fully functional embedded system prototypes for a popular nanopore sequence analysisworkﬂow. It is shown the performance of the prototypical embedded systems employed withproposed optimisations is surprisingly comparable to the performance on the same workﬂow(unoptimised version) run on a high-performance server.Then, going beyond embedded systems, chapter 7 presents the identiﬁcation of the primarybottleneck in nanopore sequence analysis workﬂows that seriously aﬀect high-performanceservers. Solutions for alleviating this bottleneck are also presented.Finally, the thesis is concluded in chapter 8 with a discussion of future directions.

Chapter 3 is published in IEEE/ACM transactions on computational biology and bioinfor-matics [24]. Chapter 4 is published in Nature Scientiﬁc Reports [25] and has received globalattention amongst the community (altimeter score 89, picked up by 8 news outlets). Chapter5 is available as a pre-print in bioRxiv [26] and a modiﬁed version is accepted for publicationBMC Bioinformatics. Chapter 6 may be adapted for publication in the future. Chapter 7 isprepared to be submitted to an IEEE/ACM journal or conference proceedings.In addition, collaborative research conducted in close relation to the work presented underthis thesis has led to second author publications and pre-prints in [27–29]. However, none ofthe content from these second author publications is claimed as a part of this thesis.15HAPTER 1. INTRODUCTION

This thesis beneﬁts the community through a number of contributions to existing open-source bioinformatics software and the development of new open-source bioinformatics soft-ware. Those existing open-source software tools that were contributed are

Platypus variantcaller (see chapter 3), popular sequence aligner

Minimap2 (see chapter 4, appendix A andappendix G) and popular nanopore signal analysis toolkit

Nanopolish (see appendix G). Thenew bioinformatics software developed under this thesis are f5c (see chapter 5, appendix Band appendix C) and f5p (see chapter 6 and appendix E). Also, the design and the associatedsoftware for the prototype embedded systems are released as open-source (see chapter 6 andappendix E).

The research conducted in support of this thesis has won third place in the grand ﬁnal of ACMStudent Research Competition 2020, amongst competitors from 22 major ACM conferences.The entry to the ACM SRC Grand ﬁnals was through winning the ﬁrst place in ACM SRC atESWEEK 2019 conference. Research conducted under this thesis has received the best posteraward in Australasian Genomic Technologies Association Conference 2019.16.4. THIS THESIS

Knowledge of Computer ArchitectureKnowledge of Biological Data & Associated Algorithms

Optimisations

Reference genome (3.2 billion bases)

Disk I/ODisk I/O

RAMRAM

CachesCaches

RegistersRegisters

Disk I/O

RAM

Caches

Registers

CPU

KBs to

MBs

GBs

TBs

Bs to

KBs~1-10ns~100ns~10us to <1 ns

AD Q1Q4ENBRegister

Disk I/O

RAM

Caches

Registers

CPU

KBs to

MBs

GBs

TBs

Bs to

KBs~1-10ns~100ns~10us to <1 ns

AD Q1Q4ENBRegister

CPU

Memory controller hub

PCIeeg: GPU

GB/s G B / s RAM G B / s I/O controller hub Ethernet

SATA USB

MB/s

Efficient resource yet with no impact on accuracy k-mer 1 : loc1, loc4K-mer 2 : loc2, loc3 …………….

50% repeat read indexPossible hits

Best Alignment

Variant calling amongst noise

Figure 1.5: Synergistic use of knowledge from both computer architecture and biology17HAPTER 2. LITERATURE REVIEW

Chapter 2

Literature Review

In this chapter, the background of DNA sequencing is discussed in section 2.1. Then, thebackground of sequence analysis and associated data structures and algorithms are discussedin section 2.2. In section 2.3, related work that has focused on computational optimisation ofsequence analysis software is discussed.

In this subsection, the terminology in DNA sequencing and basic concepts of DNA sequencingare introduced. 18.1. DNA SEQUENCING

Deoxyribonucleic Acid (DNA) is the blueprint of life. DNA is a molecule that encodes thestructure and the function of a living organism [30]. A closer analogy from computer scienceis a computer program. A computer program is composed of instructions and data to achievea particular outcome, whereas DNA is composed of instructions and data to make a livingorganism from scratch and to maintain its function. Instructions and data in a computerprogram are encoded in binary (base-2), whereas instructions and data in DNA are encodedin quaternary (base-4). The four bases in the DNA alphabet are Adenine (A), Cytosine (C),Guanine (G) and Thymine (T), which are molecules called nucleotides.A long chain of nucleotide bases connected through chemical bonds forms a

DNA strand . Twosuch strands that are coiled around each other, forming the double helix-shaped DNA molecule(Fig. 2.1). Both strands contain the same information and having two such strands facilitatesDNA replication, the process by which DNA copies itself during cell division. The two strandsare complementary to each other and are held together by hydrogen bonds between G-C andA-T base pairs, i.e., a base ‘G’ is complementary to base ‘C’ (and vice versa) and a base ‘A’is complementary to base ‘T’ and vice versa).

A DNA molecule is tightly coiled many times and packaged with proteins to form a structurecalled a chromosome . Inside the nucleus of every cell of a human being, there are 23 pairsof such chromosomes (Fig. 2.1). Those chromosome pairs are named chromosome 1 tochromosome 22 and the 23 rd chromosome pair determines the sex. This 23 rd chromosomepair in females contains two X chromosomes and in males contains an X chromosome and a Ychromosome. In humans, the largest chromosome is chromosome 1 ( ∼

247 million bases) andthe smallest is chromosome 21 ( ∼

47 million bases). In each chromosome pair, one chromosome19HAPTER 2. LITERATURE REVIEWis inherited from the mother and the other from the father.The number of chromosomes varies amongst organisms. It can be just one chromosome oreven thousands of chromosomes. The ploidy —whether the chromosomes exist in pairs, singleor a higher number of sets—also varies amongst diﬀerent organisms.

The complete nucleotide sequence of all the chromosomes within a cell is called the genome .The size of a genome is measured using the number of nucleotide bases. Following the metricpreﬁxes, thousands of bases, millions of bases and billions of bases can be called kilobases,megabases and gigabases, respectively. Biologists typically use kb, Mb and Gb as symbolsfor those units, but this thesis uses the symbols kbases, Mbases and Gbases to prevent anyconfusion with kilobytes, megabytes and gigabytes.The sizes of the genome of various organisms are listed in Table 2.1. Viral genomes aretypically the smallest with a size of several thousand bases. Bacterial genomes can vary fromseveral hundred thousand bases to a few million bases ( ∼

100 kbases to ∼

15 Mbases). Insectgenomes are in the order of hundreds of millions of bases ( ∼

100 Mbases to ∼

900 Mbases). Thegenome size of complex organisms is billions of bases long. For instance, the human genomeis 3.1 Gbases (6.2 Gbases if both chromosomes in a pair are considered). The largest genomefound so far is of a rare Japanese ﬂower plant called

Paris japonica and is 149 Gbases long.The widely used ﬁle format for storing a genome on a computer is the

FASTA ( .fa ) format. FASTA is a simple text-based format where the characters ‘A’, ‘C’, ‘G’ and ‘T’ denote thenucleotide bases. An example genome stored in the

FASTA format is given in Fig. 2.2. Theline that starts with a ‘>’ character contains the name of the chromosome (may contain othermetadata) and the subsequent lines contain the actual DNA sequence (Fig. 2.2).20.1. DNA SEQUENCING

Cell

Nucleus

DNA

Chromosome

Figure 2.1: The DNA molecule and a chromosome. Chromosomes are present inside thenucleus of a cell of an organism. The DNA molecule that is the major component ofa chromosome is double-stranded. A DNA strand is composed of four nucleotide basesA, C, G and T depicted in blue, yellow, red and green, respectively. The ﬁgure is from https://commons.wikimedia.org/wiki/File:Eukaryote_DNA.svg licensed under CreativeCommons Attribution-Share Alike 3.0 Unported license.21HAPTER 2. LITERATURE REVIEWTable 2.1: Genome size of various organisms

Organism Genome size

HIV (virus) 10 kbasesH1N1 (virus) 14 kbasesSARS-CoV-2 (COVID-19 virus) 30 kbases

Helicobacter pylori (bacteria) 1.7 Mbases

Escherichia Coli (bacteria) 4.6 MbasesYeast 12.1 MbasesFruit ﬂy (insect) 140 MbasesMouse 2.5 GbasesCow 3 GbasesHuman 3.1 GbasesWheat 17 GbasesMarbled lungﬁsh 130 Gbases

Paris japonica (plant) 149 GbasesTo save space,

FASTA can be compressed using the extended gzip format called bgzip thatallows random access to genomic locations in the compressed ﬁle at the expense of a slightlylesser compression ratio than gzip .A representative example of the genome of a particular species if known as the referencegenome . Species such as humans have a high-quality reference as a result of the human genomeproject that produced a draft assembly, which was subsequently improved by scientists overthe years. The latest version of the human genome is named as GRCh38 (Genome ReferenceConsortium Human Build 38).

Repeats are quite common in genomes, for instance, more than 50% of the human genomeis composed of repeats [13, 14].

Repeats are also known by terms such as repeated sequences , repetitive elements or repeat regions . Repeats have always introduced complications to thesequence analysis process, due to reads coming from such regions that are non-unique being22.1. DNA SEQUENCING >chr1GCCCTGGGTGTGACTCTGGGGGTGCAGGCTCCTCCCACCCACAGAGAGCCCCCCCACATGCATGGGTGTCCTGGGGATGCTGGTGGTCAGGGGTCAGTGGCCTGGGCAGGCTGGGGAAGCCTGGCCCTCCCATAGCCTGCTGTGGACAATCAGGAAGCCCCAAGCTTGGGGGCAGCCTCGCCCGCAGCCACCGGGGACTCCTGGGTGTGTGTTCCGCTCGCCTCTGCCGCGTGTCTGTCCCTTTCTCTGCCGTGTCTGCTGTGCATCTGGCCCTTCTCCTGTGTTCTCTCTTCCTCCACC ………………………………………………………………………………………………………………… >chr2ACCCATATATATACATATACACACATATACATACATACACACACAGCTTGGTTACAATGC ATATTTTTGTTTCTTTGCTTAGTAAAAGATCTACCACATTGTACATAACAAATAGACATT

TCTACTGTTCGTTGATATGAAATAACTGTAAAAAACTTAATTGTCCTTACACTTTGTGTTTAGATGTGGCAAGTAGCAAGAGACTGTAGTAACCACTGTAAACCATGACTACACATAGATAAACTCTCAGATCATAGTTCTTTAAAATCTATGCAAGAGCTTTCTAAAAAAGAAGCATAC …………………………………………………………………………………………………………………

Figure 2.2: An example of

FASTA ﬁle format. The dotted lines are to indicate that thelong chromosome sequences continue, i.e., dotted lines are not actually present in a

FASTA ﬁle. Note that the sequences here are hypothetical and are not representative of a particularspecies. 23HAPTER 2. LITERATURE REVIEWextremely challenging to be accurately aligned [14].Repeats have been classiﬁed into several types based on the characteristics of the sequence,for instance, satellite repeats, simple repeats, tandem repeats, transposons, etc. [22, 31].

The exact interpretation of the genome is not fully understood yet. However, scientists haveinterpreted the genome to a considerable extent. Millions of regions in the DNA called genesare individually or collectively responsible for features and functions of a living organism.

About 99.5% of the genome of all humans is the same [12]. The 0.5% diﬀerence encompasseshuman genetic variation. The diﬀerences in the genome of a particular individual to thereference genome of the particular species are called variants.Diﬀerent types of variants exist. A variation of a single nucleotide base is called a SingleNucleotide Variation (SNV). An SNV that is prevalent amongst a suﬃciently large fraction ofthe population is referred to as Single Nucleotide Polymorphism (SNP). Insertion or deletionof one or more contiguous bases is called an Indel [32]. Examples of these three types ofvariants SNV, insertions and deletions are shown in Fig. 2.3. Indels can be small as one ortwo bases or large as ten thousand bases. Variants that are 50 or more bases (50 is the typicalvalue and this number can be arbitrary) are known as structural variants [33]. Structuralvariants include many diﬀerent sub-types such as long insertions, long deletions, copy numbervariants and inversions.Most of these variants cause natural diﬀerences among individuals. However, some of thevariants are responsible for various diseases. For instance, diseases such as sickle cell anaemia24.1. DNA SEQUENCING

DNA sequence of the reference

DNA sequence of the sample

C T C G A T G A G C C T A C G

DNA sequence of the reference

C T C G A T G C G C C T A C G| | | | | | | \ \ \ \ \ \ \ \

DNA sequence of the sample

C T C G A T G A C G C C T A C G

DNA sequence of the reference

C T C G A T G C G C C T A C G| | | | | | | / / / / / / /

DNA sequence of the sample

C T C G A T G G C C T A C G (a) an example of an SNV(b) an example of a single base insertion(c) an example of a single base deletion

Figure 2.3: Elaboration of SNV and Indels25HAPTER 2. LITERATURE REVIEW

Figure 2.4: An example of VCF ﬁle format[34] and beta-thalassemia [35] are directly associated with SNVs. Diseases such as Asthmaand Allergic Rhinitis are caused by a complex contribution of both genetic and environmentalfactors [36]. A large number of genetic variants that contribute to various diseases havebeen discovered.

ClinVar is a public database containing such medically signiﬁcant geneticvariants [37]. More and more novel variants and their correspondence to various conditionsare being readily discovered.Detected variants are typically stored in the ﬁle format called Variant Call Format (VCF) [38],which is text-based format exempliﬁed in Fig. 2.4. The header contains lines starting with‘

Nucleotide bases in the DNA can naturally undergo biochemical modiﬁcations when chemicalcompounds are attached to nucleotide bases. Such nucleotide bases with additional chemicalcompounds attached are known as modiﬁed bases. To date, more than 17 diﬀerent types ofbase modiﬁcations have been identiﬁed in DNA [39]. The set of base modiﬁcations undergoneby every base in the genome when taken together is called the epigenome . A common type26.1. DNA SEQUENCINGof base modiﬁcation in humans is the addition of methyl groups to nucleotide bases, which isknown as DNA methylation.DNA methylation is known to be a regulator of the genome, i.e., the expression of genes canbe regulated by base modiﬁcations. DNA methylation is also known to aﬀect developmentand tissue diﬀerentiation. DNA methylation is altered by environmental factors, but thosealterations can be passed onto the next generations.

The DNA molecule exists inside a human cell in a very compact form (scale of nanometres)with the DNA strand coiled many times, which if uncoiled would be a few metres long. Toread the DNA strand, it has to be extracted from the cell and uncoiled. The DNA strand beingvery fragile when uncoiled, reading the full DNA strand from one end to another accuratelyis still a technological challenge. The best available technology today can only read this DNAstrand in fragments of contiguous bases. This is due to the fragile DNA strand breaking intofragments at random locations during the DNA extraction process from the cell, uncoiling andeven during the reading process. The process of reading the DNA sequence is termed

DNAsequencing [40] and the machines that perform this sequencing are called

DNA sequencers . The sequencing machine takes a tissue sample of an organism, for instance, blood (moreaccurately a prepared sample out of tissue where the DNA strands have been extracted), andoutputs the order of the bases in a digital form. The DNA strands break into fragments andthe sequencing machine reads these fragments of DNA strands . The resultant series of datapoints denoting bases of a DNA fragment is called a read . The reads are output in random Fragmentation can be intentional in certain technologies such as Illumina.27HAPTER 2. LITERATURE REVIEW

Original DNA sequence

C T C T G G G G G T G C A G GC T C T GT C T G GT C T G GC T G G GT G G G GG G G G TG G G T GG G T G CG T G C AG T G C AG T G C A coverage = 4Xreads

Figure 2.5: Elaboration of the concept of coverage in sequencing.order by the sequencer. This is mainly due to the DNA fragments ﬂoating randomly in theliquid solution of the sample.

A sample prepared for sequencing contains fragments of DNA that originated from nearlyidentical DNA molecules (each cell has a copy of the DNA and there are millions of cellsin a sample). Fragmentation of DNA occurs at random locations on the DNA strand. TheSequencer randomly sequences a subset of these DNA fragments ﬂoating in the solution andoutputs them as reads. Consequently, a single position in the genome is covered by multiplereads. The number of reads that overlap a particular position on the genome is known as the depth or coverage . Fig. 2.5 elaborates the coverage using an example. In the example, thecoverage of the marked base position is 4 × because the particular position on the originalsequence is covered by four reads. 28.1. DNA SEQUENCING The process of converting direct or indirect measurements of the nucleotide bases (in theDNA strand) captured by sensors in the sequencer into ASCII reads is called base-calling .The base-calling process is not 100% accurate due to the presence of noise in measurements,sensor limitations and restrictions of the software involved in base-calling. One or more basesin a read can be incorrectly base-called and such errors are known as base-calling errors orsequencing errors.The volume of data output by a sequencer or the sequencing yield is typically measured usingthe total number of bases in all the reads generated during a sequencing run (the duration inwhich the sequencing machine is operated). Modern sequencers can generate reads that sumto billions of bases and thus the common unit used for yield is Gbases.Base-called reads are typically stored in the ﬁle format called

FASTQ [41], a text-based ﬁleformat extended from the previously discussed

FASTA format. In

FASTQ format, a singleread takes four lines (Fig. 2.6): the ﬁrst line is the read name (read identiﬁer and optionalmetadata) that starts with an ‘@’ character; the second line is the actual read sequence inACGT characters; the third line is always a ‘+’ character; and, the fourth line is the per-basephred quality score encoded in ASCII (phred quality score Q is given by Q = − log ( P )where P is the base-calling error probabilities [42]). As of today, there have been three generations of DNA sequencing technologies. They aredetailed below. 29HAPTER 2. LITERATURE REVIEW @read1CTCGATGCGCCTACGTTCAGTTCACATGTTGCTGCTTTCGCATTTTATCGGTAGAGCACC + @read2 ATGTTTGTGGCGTTTCAGTTACGTGGCCTGTTTCCGCATTTATCGGTAGAAACTGCCTTT+ $%%%'&&)&&'%)+)($*1($'&)1&'$%$&&&(*$(%,29+10/)**'*.()*)+'*11 @read3

TTGTTCGGATTTACCGTATTGCCTGTTTTCGCATTTTACTCATTGAGGAAGCGCTTTCTG +

Figure 2.6: An example of

FASTQ ﬁle format

Sanger et al. used a method called the plus-minus system to sequence the ﬁrst completeDNA which was of a virus in 1977 [43]. The introduction of the chain termination method(also known as Sanger Sequencing) [44] was a breakthrough in sequencing technologies dueto its accuracy and convenience. With various improvements to this method, automatedDNA sequencers were produced that were capable of sequencing complex genomes. Theseautomated Sanger sequencers are known as ﬁrst-generation sequencing machines.First-generation sequencers can produce high-quality (accurate) long-reads at the expenseof high cost and low throughput. For instance, Applied Biosystems 3730xl ﬁrst-generationsequencer in Fig. 2.7 could output reads at around 99% accuracy and 400 to 900 bases length.However, a single sequencing run that spans over a duration of 20 minutes to 3 hours generatesonly 1.9-84 kbases [45]. In fact, the human genome project that started in 1990 mainly usedﬁrst-generation sequencers [46]. The human genome project took 13 years to complete at an30.1. DNA SEQUENCINGFigure 2.7: First-generation sequencers. A row of Applied Biosystems 3730xl DNA Analyzermachines. Photo is from [47] licensed under CC BY 2.0. The weight of a machine is 180 kgand the dimensions are 100 cm (W) x 73 cm (D) x 89 cm (H) [48].expense of billions of dollars. Today, ﬁrst-generation sequencers are infrequently used.

In 1985, a diﬀerent technique to that used in ﬁrst-generation sequencers was introduced [49]and the eventual improvements in the 1990s led to the second-generation sequencing tech-nology. In literature, the term next-generation sequencing has been used instead of second-generation sequencing, which is no longer appropriate due to the emergence of the third-generation. Therefore, this thesis will continuously use the term second-generation sequenc-31HAPTER 2. LITERATURE REVIEWing.Second-generation sequencers are capable of sequencing multiple DNA fragments (up tobillions of fragments) in parallel and thus are also referred to using terms such as high-throughput sequencing or massively parallel sequencing . The sample preparation step forsecond-generation sequencing involves the intentional fragmentation of DNA strands intoshort pieces. The read lengths produced by second-generation sequencers are around 75-500bases and these reads are referred to as short-reads . Second-generation sequencing has enabledsequencing complete genomes at an extremely low cost at a much faster rate when comparedto ﬁrst-generation sequencers. For instance, the Illumina X Ten sequencer was the ﬁrst toachieve whole-genome sequencing (WGS) for 1000 USD in less than 3 days [50].Illumina has become the dominant company in the production of second-generation sequencers.Illumina machines have an error rate of around 0.1%-1% per each base sequenced [51]. Fig. 2.8depicts two diﬀerent Illumina sequencing machines, HiSeq 2500 used in large-scale sequencingcentres and Miseq that is a relatively smaller benchtop device.Second-generation sequencers are widely used at present. Due to the low cost of sequenc-ing with good accuracy, second-generation sequencers are suitable for SNV and short indeldetection. However, the primary limitation of second-generation sequencing is that variantsoccurring in repeat regions of the genome cannot be easily resolved. This is because readscoming from such repeat regions usually align to multiple locations of the reference genome.Also, structural variants that are longer than the short-read lengths cannot be easily identiﬁedusing second-generation sequencing.Chapter 3 in this thesis is about software used to analyse second-generation sequencing data.32.1. DNA SEQUENCING Illumina HiSeq 2500 Illumina MiSeq

Read length

Accuracy ~99.9%

Sequencing yield per run

Timer per run

Weight

312 kg 93.6 kg

Length

Depth

Height

Figure 2.8: Illumina second-generation sequencers. Photograph of Hiseq sequencer isfrom https://commons.wikimedia.org/wiki/File:Illumina_HiSeq_2500.jpg and Miseqis from https://en.wikipedia.org/wiki/File:Illumina_MiSeq_sequencer.jpg , both li-censed under CC0 1.0. Note that sequencing yield. Note that values for sequencing yield areto give a rough idea and may change based on a number of factors. Time of a sequencingrun is given as a range since the exact value diﬀers based on the conﬁgured value for the readlength during sequencing. 33HAPTER 2. LITERATURE REVIEW

Sequencing approaches that are diﬀerent from the second-generation sequencing appeared inthe late 2000s and eventually led to the third-generation of sequencing technology [52]. Third-generation sequencers produce much longer reads with lower accuracy when compared tosecond-generation sequencers [53]. Reads produced by third-generation sequencers are knownas long-reads. Similar to second-generation sequencers, third-generation sequencers are alsocapable of sequencing thousands of reads in parallel and thus fall under the category of high-throughput sequencers. Currently, two major companies produce third-generation sequencers.These are: Paciﬁc Biosciences (PacBio); and, Oxford Nanopore Technologies (ONT). Third-generation sequencing technologies are under active development and are not as matured assecond-generation sequencers. The read lengths and the accuracy are continually improvingwith time, and the values given here are to give a rough idea.PacBio uses a technology known as Single-Molecule Realtime Sequencing (SMRT). Fig. 2.9depicts one of their sequencers called Sequel. PacBio sequencers can produce reads under twodistinct modes. These are Continuous Long-Reads (CLR); and, Circular Consensus Sequenc-ing (CCS) reads. CLR are much longer (up to around 40 kbases) at the expense of loweraccuracy ( ∼ ∼ raw signals andare used during the base-calling process to deduce nucleotide sequences. Thus, nanoporesequencers are capable of directly measuring the actual DNA strand, unlike other sequencing34.1. DNA SEQUENCINGFigure 2.9: Paciﬁc Biosciences Sequel Sequencer. Photograph from https://en.wikipedia.org/wiki/File:SequelSequencer.jpg licensed under CC BY-SA 4.0. Dimensions are 92.7x 86.4 x 167.6 cm 35HAPTER 2. LITERATURE REVIEWtechnologies (second-generation Illumina or third-generation PacBio) that perform sequencingby synthesis .The average length of reads produced by nanopore sequencers is typically 10-20 kbases, and theexact value of the length depends on fragmentation during sample preparation and the librarypreparation protocol. Ultra-long-reads that are longer than 1 Mbases have been recorded. Theaccuracy of raw base-called reads of Nanopore sequencers is ∼ ∼ MinION GridION PromethION

Read length

Average 10kbases, even up to >1Mbases

Accuracy

Sequencing yield per run

15 - 30 Gbases 75 - 150 Gbases 2.4 - 8.6 Tbases

Time per run

48 hours 48 hours 72 hours

Weight

87 g 11 kg 28 kg

Length

Depth

Height

Figure 2.10: Nanopore third-generation sequencers. The photographs are from Nanopore https://nanoporetech.com/about-us/for-the-media . The read lengths and the sequenc-ing yields values are for the purpose of giving a rough idea and may change on a variety ofdiﬀerent factors. 37HAPTER 2. LITERATURE REVIEW (a) MinION sequencer (right)connected to the MinIT base-calling unit (left) (b) PromethION sequencer(left) and its compute tower(right) (c) MinION Mk1C sequencerwith integrated base-callingunit

Figure 2.11: ONT MinIT, PromethION compute tower and MinION Mk1C. The photographsare from Nanopore https://nanoporetech.com/about-us/for-the-media .mosome of the human genome reference were only resolved very recently using third-generationsequencing technology [23].Unlike other technologies, Nanopore sequencers can stream data in real-time which facilitatesdata analysis on-the-ﬂy (while the sequencer is operating). Also, Nanopore sequencers suchas the MinION are ultra-portable, and they are in harmony with the intention of this thesisto construct embedded systems for sequence analysis.

The goal of sequence analysis is to: assemble the reads into the actual DNA sequence in thesample (or the genome); or, to compare diﬀerences in the reads to a reference genome (e.g.,to detect variants or epigenetic modiﬁcations). The former is performed when a high-qualityreference genome is not available and thus the assembly has to be performed from the scratch(known as de novo assembly ). The latter performed When a high-quality reference genome isavailable (referred to as reference-guided sequence analysis ). This thesis focuses on reference-guided sequence analysis. For well-known species like humans, scientists have spent yearscompiling a high-quality reference sequence. Therefore, for most practical purposes involving38.2. SEQUENCE ANALYSIShumans, reference-guided sequence analysis is adequate.While the reference-guided sequence analysis has some similarities between second-generationand third-generation sequencing, there are important diﬀerences. Section 2.2.1, describesthe typical reference-guided sequence analysis workﬂow for second-generation sequencing andsection 2.2.2 for third-generation sequencing. Despite being not required for the thesis, a briefaccount of de novo assembly is given in section 2.2.3 for the sake of completeness.

A simpliﬁed second-generation bioinformatics workﬂow is given in Fig. 2.12. Certain work-ﬂows may contain additional steps such as ﬁltering and calibration (i.e. GATK Best Practicespipeline from Broad institute in Fig. 2.13). However, the most important and computationallychallenging steps are the ones shown in Fig. 2.12. sequence alignment sorting variant calling reads(FASTQ file) variants(VCF file)alignment records(SAM file) sorted alignment records(SAM file)reference genome(FASTA file)

Figure 2.12: Simpliﬁed second-generation workﬂowThe reads, typically in

FASTQ format (discussed previously in section 2.1.1.11), are ﬁrstaligned to the reference genome (step one in Fig. 2.12). This process is known by various termssuch as sequence alignment , read alignment or read mapping . Sequence alignment process39HAPTER 2. LITERATURE REVIEWFigure 2.13: GATK Best Practices pipeline. Image from https://gatk.broadinstitute.org produces the alignment records for every read (whether the read was successfully mapped,mapping coordinates, mapping quality, etc.), in a ﬁle format called sequence alignment/mapformat (SAM) [57]. Tools and associated algorithms for sequence alignment are detailed insection 2.2.1.1.The alignment records in the SAM ﬁle are then sorted (step two in Fig. 2.12) based on genomiccoordinates. That is, sorting ﬁrst by chromosome order and then by base position in eachchromosome. The sorted alignment records are typically stored in a ﬁle format called BAM,which is a binary version of SAM format with BGZF compression support [57]. BAM allowsrandom accesses to alignment records for a given genomics region through an index called theBAM index. The most popular tool for sorting is samtools [57] written in C programminglanguage, which is reasonably optimised for performance. Other tools such as Picard [58]written in Java programming language and

Sambamba [59] written in D programming languagecan also be used for sorting.The next step is the identiﬁcation of variants amongst sequencing errors and alignment arte-40.2. SEQUENCE ANALYSISfacts, and this process is known as variant calling (step three in Fig. 2.12). The variantcalling step takes the sorted alignment records (BAM ﬁle) and the reference genome (

FASTA ﬁle) and outputs the identiﬁed variants in VCF ﬁle format (discussed previously in section2.1.1.6). These variants can reveal important information about the individual, such as diseasepredisposition and drug response. However, variant calling is quite challenging as variantsshould be diﬀerentiated correctly from sequencing errors and alignment artefacts. Tools andassociated algorithms for variant calling are detailed in section 2.2.1.2.

Fig. 2.14 is a simpliﬁed elaboration of sequence alignment. Sixteen reads with read lengthsof 8 bases have been aligned to the reference. The diﬀerences in the reads to the reference(due to sequencing errors or actual variants) have been shaded in grey. Note that, onlysingle base mismatches are in this demonstration, where in reality there can be insertions anddeletions. The average number of reads that overlaps a particular nucleotide position is calledcoverage. Terms such as depth and depth of coverage are also interchangeably used [60]. Therequired coverage depends on the application [60], for instance, a coverage of 30X or more isrecommended [61] for detection of SNV and indels.To date, a large number of sequence alignment tools have been published [62]. Modernsequence alignment tools typically perform the alignment in two steps: ﬁrst, potential mappinglocations of a given read on the reference genome are searched using an index (e.g. hash table);and second, the read is aligned at base-level to those potential locations in the reference usingdynamic programming-based alignment algorithms to identify the optimal alignment.Use of an index is required to reduce the search space in a large genome. Performing base-levelalignment of a read on to the whole reference genome is impractical due to computationaland memory complexity when the reference genome is large. Locating a few locations on the41HAPTER 2. LITERATURE REVIEW

Reference

C T G G C C C T C C C A T A G C C T G C T G T G G A C A A T C A G G A A G C C C C T C C T G C T GG C C G T C C C G C A G C C C CC T G G G C C TT C C C A T A TA T A T C C T G aligned

A T C A G G A A reads

A G C C C T C C T A G A C A A TA T C C G G A AC T G C T G T GC T G G C C C TT T C C A T A C A G C T G T G GG G A C A A T C G G A A G C T C

Figure 2.14: Simpliﬁed illustration of aligned sequence reads to a referencereference genome (for instance 5-10) using an index is thus vital. The two common indexingapproaches use hash tables and the Burrows-Wheeler transform (BWT) [63].Earlier short-read alignment tools used the hash table-based approach. The alignment tool

MAQ [64] builds a hash table out of the reads and iterate through the reference sequence toﬁnd potential mappings. In contrast, alignment tools such as

SOAP [65] and

BFAST [66]build the hash table using the reference genome and iterate through the reads to ﬁnd potentialmappings.Modern short-read alignments tools typically rely on a BWT-based index called an FM-index [67]. An FM-index is constructed by taking the BWT of the reference genome, whicheﬀectively compresses the data while allowing sub-string indexing at the same time. The FM-index-based approach has gained popularity due to its superiority to hash tables in termsof both performance and memory footprint. Alignment tools such as BOWTIE [68, 69],BWA [70–72] and SOAP2 [73] use this approach.After potential mapping locations are identiﬁed quickly using an index, more accurate base-42.2. SEQUENCE ANALYSISlevel alignment algorithms are dispatched to ﬁnd the optimal alignment. These algorithms todetermine the optimal alignment between two biological sequences typically utilise dynamicprogramming (DP). Very ﬁrst of such algorithms, the Needleman-Wunsch (NW) algorithmdates back to the 1970s [74]. NW and its variant, the Smith-Waterman (SW) algorithm [75]are of quadratic time and space complexity. Both NW and SW were used extensively toperform ﬁne alignment of DNA sequences with high quality. However, due to its extended timeconsumption, several heuristic improvements have been proposed by researchers to improvethe speed of alignment without losing quality.Fig. 2.15a exempliﬁes an original SW based alignment (no heuristic) between two sequences, target sequence t t t t t t (6 bases long), and query sequence q q q q q q q q (8 bases long).The DP table (scoring matrix) contains 6x8 cells as shown. First, the initial values are set(shown as 0 in the ﬁgure); second, the score for each cell (s x,y ) is computed based on a scoringscheme; and third, the trace-back (backtracking denoted by red arrows on the ﬁgure) startingfrom the highest-scoring cell and ending at a cell with 0 score, outputs the optimal alignmentthat yields the highest score (please refer [76] for a detailed explanation of SW).In the case of short-read alignment, the sequences to be aligned are small (typically 75-500 bases). Two sequences (each sequence ~100 bases long) can be aligned by ﬁlling ~10 cells. While a single such alignment can be quickly handled by a modern computer, it iscomputationally demanding when the number of alignments to be performed scales up tohundreds of millions and billions, which is the case for short-reads. To reduce the number ofcomputations, banded alignment approaches were introduced [77], where only the cells in theDP table along the left diagonal band are computed as shown in Fig. 2.15b. The underlyingassumption is that the sequences that are aligned to each other are essentially similar, thusthe alignment (the trace-back arrows) should lie close to the left diagonal. Note that in theﬁgure, only the cells in a band of width (W) four have been computed. This computation issuﬃcient since the band contains the alignment.43HAPTER 2. LITERATURE REVIEWX-drop in BLAST (Basic Local Alignment Search Tool) [78] is another notable heuristic to SWthat terminates the computation when the drop in the alignment score reaches a threshold.An extended version of X-drop called Z-drop is used in the modern alignment tool BWAMEM [72].In addition to computing the alignment and the alignment score for each read, modern align-ment tools also compute an important quantity called the Mapping Quality (MAPQ). Theconcept of mapping quality was introduced in MAQ aligner [64]. MAPQ is computed per readas: − log ( P ) rounded oﬀ to the nearest integer, where P is the probability of the mappingposition being incorrect. This probability value is heuristically determined through diﬀerentformulas in diﬀerent software but essentially considers both the alignment score and the num-ber of sub-optimal mappings of the read. A higher number of sub-optimal mappings meansthat the read is likely to be from a repeat sequence and thus the chance of being incorrectis high. MAPQ is an important score for the variant calling step, i.e., to avoid false-positivevariants. Variant calling is the process of identifying the variants amongst sequencing errors and align-ment artefacts. One of the simplest possible examples illustrating the variant calling processin Fig. 2.16, which is based on the same reads and the reference used in the previous example(Fig. 2.14). Note that in Fig. 2.16, the reads have been sorted based on genomic coordinatesand the marked variant is simply based on the majority vote. However, such a simple strategywill not be adequate for accurately identifying variants in real genomic data (to minimise bothfalse positives and false negatives) and numerous sophisticated variant calling software toolshave been introduced.More than 40 open source tools have been released in the last decade [79]. Most tools utilise a44.2. SEQUENCE ANALYSISprobabilistic framework (Bayesian approach is the most common) and popular variant callingtools such as GATK UniﬁedGenotyper [80], GATK HaplotypeCaller [81] (the of UniﬁedGeno-typer), FreeBayes [82], SAMtools package ( samtools and bcftools ) [83] and Platypus [84] aresome examples. In contrast to the probabilistic methods, certain tools such as VarScan relyon heuristic approaches [85, 86].Past variant callers (e.g, GATK UniﬁedGenotyper) solely relied on the read alignment per-formed by the aligning tool. However, alignment artefacts due to indels were found to aﬀectthe accuracy of the variants calling results [87]. Thus, separate pre-processing tools suchas GATK IndelRealigner were introduced to perform local re-alignment in the aﬀected re-gions [88] before executing the variant caller. Modern variant calling tools such as GATKHaplotypeCaller and Platypus have a built-in local de novo assembly step to address theaforementioned issue, making GATK IndelRealigner redundant. In local de novo assembly,the genome is broken into small regions and de novo assembly is performed separately in theseregions. For local de novo assembly, Platypus uses a variant of Bruijn graphs called colouredde Bruijn graphs [89], while GATK HaplotypeCaller also uses a de Bruijn like graph [90].In the past variant callers (e.g, GATK UniﬁedGenotyper), each base position on the genomewas considered independently when calculating probabilities. However, recent variant callerssuch as GATK HaplotypeCaller and Platypus breaks the genome into overlapping haplo-types based on initially identiﬁed variations. They perform probability calculation on thesehaplotypes by mapping reads to each haplotype. GATK HaplotypeCaller uses pair HiddenMarkov Model (pairHMM) [91] and Platypus uses Needleman-Wunch for mapping reads tohaplotypes. Haplotype-based approaches have increased the accuracy of variant calls [92].Sandmann et al. [79] evaluated the accuracy of eight variant calling tools including GATK,Platypus and SAMtools. None of the variant callers could detect all the variants in their datasets. They also observed that increased sensitivity decreases precision. Further, the accuracy A haplotypes is a group of variants that tend to occur together45HAPTER 2. LITERATURE REVIEWof diﬀerent tools varied with diﬀerent data sets. Hence, modern variant calling tools are beingfrequently updated to gradually improve accuracy.Variant calling is a time-consuming step that takes hours on a high-performance computer.Despite this, many variant callers such as VarScan, FreeBayes, SNVer [85] and VarDict [93]do not support multi-threading. GATK HaplotypeCaller does support multi-threading. How-ever, multi-threaded executions of GATK HapplotypeCaller frequently crash and thus arenot recommended to be used as stated in the manual [94]. Even during instances that donot crash, the multi-threaded execution of GATK HaplotypeCaller marginally improves therun-time due to ineﬃcient multi-core utilisation. Further, multi-threaded execution could notreproduce the same result as single-threaded execution as observed by Sandmann et al. [79].Platypus variant caller is capable of eﬃciently utilising multi-CPU cores through its in-builtmulti-processing.Chapter 3 describes memory optimisation algorithms associated with variant calling. Speciﬁcdetails of the underlying algorithms are discussed in the background of that chapter. : In a second-generation sequencing dataset, the lengths of all the reads in thedataset are typically the same (at least for Illumina Sequencing that dominates the second-generation sequencing market). The read length can be initially conﬁgured to a particularvalue between around 50 and 500 bases at the start of a sequencing run (depending on thesequencing machine) and all the reads generated from that sequencing run would of thatconﬁgured length.

Error rate:

An example demonstrating the error rate of second-generation sequencing is inFig. 2.17. This example uses Illumina short-reads from a real dataset (NA12878 dataset from46.2. SEQUENCE ANALYSIS1000 genomes project) aligned to a reference (human genome). Fig. 2.17 is a screenshot ofa ∼ Data size:

The human genome is 3.1 Gbases and the

FASTA ﬁle (uncompressed) is around3.1 GB. If the human genome is sequenced at an average coverage of 30X, the yield is around96 Gbases. If the read length is assumed to be 100, the dataset would contain around 960million reads. A

FASTQ ﬁle (uncompressed) storing such a dataset is around 200-250 GB.The generated result from the alignment step stored in a SAM ﬁle (uncompressed) is around250-300 GB. The sorted alignments stored as a BAM ﬁle (BGZF compressed) is around 30-40GB. The VCF ﬁle generated from the variant calling step is around 1 GB.

Third-generation sequencing technology is currently under active development and no stan-dard or best practises workﬂow exists at present (as opposed to the second-generation). Third-generation sequencing workﬂows are not stable and are constantly evolving. Fig. 2.18 showsthe typical workﬂow for nanopore data processing at the time of writing.The reads (in

FASTQ ﬁle format) are ﬁrst aligned to the reference genome (step one in Fig.2.18). The alignment is conceptually similar to that of the second-generation workﬂow. How-ever, software tools used for aligning third-generation sequencing have distinct characteristicswhich are diﬀerent from the previous aligners and are detailed in section 2.2.2.1. After thealignment step, the aligned reads are sorted (step two in Fig. 2.18). The sorting step isidentical to that of the second-generation workﬂow and the most popular sorting tool remains NA12878 is a well-studied human genome sample from a particular Utah woman47HAPTER 2. LITERATURE REVIEW

Samtools . The next step (step three labelled as polishing in Fig. 2.18) now can be eithervariant calling or detection of epigenetic base modiﬁcations (e.g., methylation calling). Vari-ant calling or detection of epigenetic base modiﬁcations is a challenging process where truevariants and/or base modiﬁcations must be identiﬁed amongst highly erroneous reads (cur-rently 5%-10%). Thus, this step typically uses the raw signals (raw sensor output from thesequencer) in addition to the base-called reads. Associated software tools for variant callingand detection of epigenetic base modiﬁcations are detailed in section 2.2.2.2.As stated in section 2.1.2.3, a raw signal is the ionic current measurement when a DNA strandpasses through a protein nanopore. Nanopore sequencers output these raw signals in a ﬁleformat called fast5 . Fast5 format is essentially the Hierarchical Data Format 5 (HDF5) [95],with a speciﬁc scheme determined by ONT to store raw signal data and metadata. Before2018, a single raw signal (corresponds to a single read) was stored as a single fast5 ﬁle, whichis currently referred to as a single-fast5 ﬁle . However, millions of ﬁles generated from asequencing run were diﬃcult to manage and now a fast5 ﬁle contains a batch of raw signals(by default 4000 reads). Such fast5 ﬁles containing multiple reads are called multi-fast5 ﬁles .HDF5 is a versatile ﬁle format with numerous features (including compression). However,HDF5 is a very complicated ﬁle format of a monolithic design and a lengthy speciﬁcation.Consequently, HDF5 ﬁles must be accessed through the only existing library provided by theHDF5 group, which has limitations such as lack of eﬃcient multi-threading access. Chapter 7of this thesis explores the impact of this limitation on eﬃcient raw signal access and presentsalternate solutions to circumvent the limitation.As nanopore sequencers output raw signals, base-calling can be optionally performed exter-nally on a general-purpose computer. However, the latest nanopore sequencers either comewith an internal compute-module or support an externally attachable dedicated base-callingmodule running ONT’s proprietary base-callers. Thus, base-calling will not be consideredunder sequence analysis in this thesis. 48.2. SEQUENCE ANALYSIS

When examined from a higher level, long-read aligners also use a two-step approach similarto previous aligners: ﬁnding potential mapping locations using an index; and, applying ac-curate dynamic programming algorithms to obtain the optimal alignment. However, whenlooked microscopically, long-reads aligners have major diﬀerences in underlying algorithmsand parameters to handle distinct characteristics of long-reads. Numerous long-read align-ers have been published over the last decade, for instance, BWA MEM [72], BLASR [96],GraphMap [97], Kart [98], NGMLR [99], LAMSA [100] and Minimap2 [101]. Note that BWAMEM is an extended version of previously discussed BWA (BWA was initially designed forshort-reads), that supports long-reads up to a certain degree. Minimap2 [101] is the mostpopular long-read aligner amongst all the other long-read aligners, due to its superior perfor-mance, accuracy and robustness. Minimap2 [101] employs a hash table-based genome index toquickly locate potential mappings and is both fast and accurate compared to the FM-index-based approach in BWA MEM [72]. However, the RAM requirement is higher for a hashtable-based index compared to FM-index. For instance, The hash table data structure itselfconsumes about 8 GB of memory (RAM) in

Minimap2 . The typically RAM consumption of

Minimap2 is around 12 GB on average when memory is allocated for internal data structures(i.e. dynamic programming tables). However, the peak RAM for the human genome canoccasionally exceed 16 GB depending on the characteristics of data, such as the length of thereads. Chapter 4 of this thesis focuses on memory optimisations to

Minimap2 to reduce peakRAM.Banded versions of dynamic programming algorithms such as SW and NW (ﬁg) used forshort-reads are not directly suitable for long-reads. In contrast to short-reads, the long-reads which emanate from Nanopore, PacBio etc, have lengths which are 10 to 10000 ordersof magnitude bigger than short-reads, are noisier (with a greater number of errors) and aretypically not suitable for such small static bands. The 10% base-calling error rate would result49HAPTER 2. LITERATURE REVIEWin the alignment signiﬁcantly deviating from the diagonal (diagonal mentioned in section2.2.1.1). A major advantage of long-reads is the detection of long indels (insertions anddeletions occasionally spanning lengths longer than short-reads themselves). When aligningsuch reads, the alignment path deviates signiﬁcantly from the diagonal. The high errors andthe large indels require the bands to be of large width if they are to be static. High bandwidthrequirement causes processing times to be extremely high when aligning millions of reads. Toimprove the speed of this processing, Suzuki-Kasahara (SK) heuristic algorithm [102] wasintroduced in 2017. SK utilises an adaptive band scheme, letting a smaller band to containsuch an alignment within the band, which is exempliﬁed as below.Consider the same example in Fig. 2.15b (performed previously with a static band of size4) is now performed only with a band-width of size 3, as shown the Fig. 2.19a. Observethat the band is no longer suﬃcient to contain the whole alignment, i.e. the cell s whichpreviously contained the maximum score is no longer computed, thus the trace-back wouldbegin from the maximum value within the band, which leads to a non-optimal alignment.This is remedied using an adaptive band in Fig. 2.19b. The band moves either down or to theright (the band dynamically adapts) as determined by the Suzuki-Kasahara heuristic, whichis illustrated by blue arrows. Observe how the alignment is possible to be contained inside aband of width 3 which was previously infeasible using a static band.

As stated before, variant calling or detection of epigenetic base modiﬁcations is a downstreamprocessing step which utilises both base-space alignments and raw signals. This step reuses theraw signals to recover the lost biological information during base-calling. Previous research[56,103] has shown that the identiﬁcation of genetic variants can be improved up to an accuracyof more than 99% by using raw signal data from multiple overlapping nanopore reads. It hasalso been shown that methylated C bases can be diﬀerentiated from non-methylated C bases50.2. SEQUENCE ANALYSISby the use of signal data, using algorithms such as the one implemented in the softwarepackage

Nanopolish [104]. Thus, the downstream analysis that reuses raw signal data couldalso detect modiﬁed nucleotide bases.At the time of writing

Nanopolish [104] is the most popular software package amongst thenanopore community for variant calling and detection of epigenetic base modiﬁcations.

Na-nopolish takes the reads, their alignments to the reference genome and the raw signal of eachread as the input. Initially, the raw signal is segmented in the time domain based on suddenjumps in the signal and these segments are known as events . The events are then alignedto a hypothetical signal model using an algorithm called

Adaptive Banded Event Alignment(ABEA) . The output of ABEA and alignment details of reads to the reference genome are sentthrough a Hidden Markov Model (HMM) to detect variants or base modiﬁcations. Nanopolishis written in C/C++ and supports multi-threading through openMP. Chapter 5 of this thesisis about the optimisation of

Nanopolish (ABEA algorithm in particular) and details of thealgorithm are discussed in the background of that chapter.Tombo is another software for detection of modiﬁed bases which also uses raw signals for theprocess. Tombo has been developed in Python programming language. Recently, a few neuralnetwork-based variant callers also have been released, for instance, Medaka [105], Clairvoyante[106] and Clair [107] (successor of Medaka Clairvoyante). These neural network-based variantcallers have been developed in Python programming language and use Tensorﬂow in thebackend. Unlike

Nanopolish , these neural network-based variant callers only rely on base-called reads and are incapable of using raw signal data. : In third-generation sequencing, read lengths can signiﬁcantly vary within adataset. For instance, read lengths can be from a few hundred bases to >1 Mbases in nanopore51HAPTER 2. LITERATURE REVIEWsequencing. Fig. 2.20 shows read length distributions of nine datasets, out of the 53 publiclyavailable NA12878 datasets (diﬀerent datasets produced at diﬀerent sequencing run of theNA12878 sample) from the nanopore consortium [56]. The library preparation method isa major factor that aﬀects the read lengths. Currently, there are three library preparationmethod for nanopore, ligation and rapid , which are oﬃcially from Oxford Nanopore, and

Ultra which is community-developed [56]. Fig. 2.20 shows three datasets from each of thoselibrary preparation methods and demonstrates that the read length distributions vary notonly among diﬀerent library preparation methods but also among diﬀerent datasets from thesame library preparation method.

Error rate:

The error rate of nanopore third-generation sequencing is demonstrated in Fig.2.21 using reads from a real dataset (all NA12878 data from nanopore consortium [56]) alignedto a reference (human genome). Fig. 2.21 is a screenshot of a ∼ Data size:

When all NA12878 datasets from the nanopore consortium are aggregated, the to-tal is around 132.931 Gbases. This is equivalent to about 40X coverage of the human genome.The number of reads is 15.667 million. The

FASTQ (uncompressed) ﬁle containing these readis 250GB in size. The SAM ﬁle (uncompressed) generated by aligning these reads using

Min-imap2 is 280 GB. The sorted BAM ﬁle (compressed) is around 150 GB. Per-read methylationcalls generated from nanopolish which are in TSV format (uncompressed) consume around70 GB. The VCF ﬁle generated from the variant calling step is around 1 GB. Raw signalscorresponding to the aforementioned reads stored in the latest fast5 format (multi-fast5 withcompression) consume around 2.2 TB. Note that this used to be 46 TB a few years ago whenstored as single-fast ﬁles (mostly due to redundant data such as the event table).52.3. RELATED WORK

If it is the ﬁrst time that the DNA of the particular species is sequenced, then the reads mustbe assembled without any reference, only using the information in the reads. This process isknown as de novo assembly.Early de novo assemblers such as SEQAID [108] were based on greedy algorithms. Modernassemblers rely on graph-based techniques. Short-read assemblers mostly use de Bruijn graphs[109] and examples are ABySS [110], Velvet [109], Spades [111] and Cortex [112] use de Bruijngraphs. However, SGA [113] which is also a short-read assembler uses overlap graphs (a typeof overlap called string graphs [114]). Almost all long-read assemblers use overlap graph-based methods. Examples of long-read aligners are miniasm [115], ﬂye [116], canu [117] andwtdbg2 [118].Currently, there are three de novo assembly workﬂows: 1, using only short-reads; 2, usingonly long-reads; and, 3, using both called hybrid assembly. For more information of de novoassembly readers may refer to [29].

Studies in the ﬁeld of DNA analysis have predominantly focused on improving the accuracyof algorithms. Such improvements have further increased the computing power required forthe analysis. Compared to the plethora of studies focusing on accuracy, studies attempting toreduce the gap between DNA sequencing and analysis technologies (architecture-aware optimi-sation of sequence analysis algorithms) are minimal. This section presents those architecture-aware optimisation studies under four categories: work that optimises sequence analysis algo-rithms for general-purpose CPU, HPC, cloud computing and distributed computing in section53HAPTER 2. LITERATURE REVIEW2.3.1; work on GPU-based optimisations in section 2.3.2; FPGA-based optimisations in section2.3.3; and, specialised hardware design for sequence analysis in section 2.3.4.

Core sequence alignment algorithms for second-generation sequencing such as SW and NW(discussed in section 2.2.1.1) have been optimised to eﬃciently utilise Single-InstructionMultiple-Data (SIMD) instructions in modern Intel CPUs (SSE and AVX). Libraries suchas libssa [119], Parasail [120], SeqAn [121], SSW [122] and SWPS3 [123] are some examplesof such SIMD-based optimisations. SK, the core alignment algorithm for third-generationsequencing (discussed in section 2.2.2.1) has also been accelerated using SIMD instructionsin a library called libgaba [102]. The most popular second-generation read alignment tool,BWA MEM (discussed in section 2.2.1.1), has been very recently optimised for better cache,memory and SIMD instruction utilisation by researchers from the parallel computing lab ofIntel, yielding 2.4X improvement in performance. This work has been released as open-sourcesoftware named BWA MEM 2 [124].The GATK best practices pipeline for second-generation sequencing (discussed in section2.2.1 has been optimised for HPC environments in [125] to eﬃciently utilise available multiplecores and bandwidth. Another work called ADAM which is a library and a command-linetool enables the use of Apache Spark to eﬃciently parallelise genomic data analysis acrosscluster/cloud computing environments [126, 127].Attempts to utilise cloud computing for DNA analysis have been made [128–130]. However,transferring DNA data which are hundreds of gigabytes in size over the Internet is not eﬃcientas the data transfer itself may consume more time than the analysis. Additionally, uploadingsensitive DNA information has privacy concerns [131].54.3. RELATED WORK

Suitability of massively parallel Graphics Processing Units (GPU) for DNA sequence analysishas been investigated. The core alignment algorithm SW, has been accelerated using GPUin examples such as [132–134]. GPU-accelerated aligners such as SOAP3 [135], BarraCUDA[136] and MUMmerGPU [137] have been released for second-generation sequencing. GPU-accelerated variant calling tools for second-generation sequencing such as BALSA [138] arealso available for use.GPU-acceleration eﬀorts have been made for third-generation sequencing as well. Nanoporebase-calling software known as

Guppy exploits NVIDIA GPUs for fast processing [55].

Guppy is a proprietary software provided by ONT that uses deep neural networks. Design details of

Guppy are not known due to the program being closed source.

Guppy likely beneﬁted by theplethora of work focusing on GPU optimisations in the neural network domain. Minimap2,the popular open-source base-space aligner for long-reads (discussed in section 2.2.2.1) hasbeen recently accelerated with the simultaneous use of a GPU and an Intel Xeon Phi co-processor [139]. However, the source code for this accelerated Minimap2 is not openly avail-able. Recently, NVIDIA corporation has shown an interest in developing open-source librariessuch as

Clara Genomics [140] for accelerating long-read data analysis on their GPUs.

ClaraGenomics library contributes to the nanopore data analysis domain through the accelerationof core algorithmic components such as all-vs-all read mapping and partial order alignmentfor performing de novo assembly.

The utility of Field Programmable Gate Arrays (FPGA) for accelerating key computationalkernels in second-generation sequence analysis has been explored by researchers. The SWalignment algorithm has been accelerated in work such as [141, 142]. Edit distance-based55HAPTER 2. LITERATURE REVIEWalignment has been accelerated by 40-60 times in [143]. Pair-HMM alignment, a major bot-tleneck in the GATK HaplotypeCaller, has been accelerated in studies such as [144] (487times performance improvement) and [145] (14.85 times throughput improvement). The re-ported speedups for FPGA-accelerated key algorithms are impressive. However, the overallspeedup when such components are integrated into an end-to-end analysis workﬂow is yet tobe explored.FPGA-based commercial accelerators also exist for second-generation sequence analysis. Ex-amples are DeCypher [146] and Dragen [147]. DeCypher [146] is from a company calledTimelogic. Their proprietary FPGA cards are ﬁxed to servers using Peripheral ComponentInterconnect (PCI) Express interface. Alignment algorithms such as BLAST, SW and HMMare supported on these cards. Timelogic claims that their

FPGA card is equivalent to 860generic CPU cores [146]. Dragen is from a company called Edico genome (recently acquired bythe sequencing giant Illumina). Edico genome claim that the whole genome analysis pipelineincluding sequence mapping and variant calling can be completed within 22 minutes [148]. Amajor drawback of these commercial systems is that they are proprietary, and the users arerestricted to the few algorithms provided by the company. These commercial systems are alsoprohibitively expensive when compared to purchasing general-purpose servers.To date, the use of FPGA for rapidly evolving third-generation sequence analysis is notexplored. Traditional implementations for FPGA that are done using Hardware Descriptivelanguages (HDL) are not very ﬂexible and a slight algorithmic change requires considerablemodiﬁcation in the implementation. In the future, when third-generation sequence analysisalgorithms are relatively stable, FPGA-based acceleration of such algorithms are anticipated.We believe that ongoing advancements in high-level synthesis (HLS) would further increasethe FPGA-based acceleration eﬀorts in the future. HLS attempts to improve ﬂexibility whileachieving performance similar to hand-optimised HDL. The OpenCL framework is increasinglybecoming popular for FPGA acceleration. Preliminary attempts that use open-CL framework56.3. RELATED WORKfor accelerating genomic kernels have been made in work such as [149, 150].

There have been rare attempts to design custom hardware for sequence analysis. MESGA [151]is a Multiprocessor system on a chip (MPSoC) architecture based on embedded processorsfor short-read alignment. DARWIN [152] is a co-processor for long-read alignment. Largespeedups have been reported for these custom hardware, as anticipated. However, these sys-tems have been evaluated only using simulations, potentially due to the impractical fabricationcost unless mass produced.Though custom hardware can provide extremely fast performance with a smaller footprint,the design ﬂow is complex and the non-recurring engineering (NRE) cost is very high. DNAanalysis algorithms improve rapidly and new algorithms are frequently introduced, especiallyfor new sequencing technologies that are ever-improving. Thus, custom hardware for genomicsprocessing is unlikely to become mainstream in the near future. In 1997, a company namedParacel introduced GeneMatcher, a specialised genome analyser based on Application-SpeciﬁcIntegrated Circuits (ASIC) [153]. The second version, GeneMatcher 2 was equipped with morethan 27000 processors. Unfortunately, ASIC based Genematcher systems did not thrive.57HAPTER 2. LITERATURE REVIEW q q q q q q q q s s s s s s s t s s s s s s s t s s s s s s s t s s s s s s s t s s s s s s s t s s s s s s s query sequence t a r g e t s e qu e n c e (a) optimal sequence alignment q q q q q q q q s s s s t s s s s s t s s s s s s t s s s s s s s t s s s s s s t s s s s s w query sequence t a r g e t s e qu e n c e (b) Banded sequence alignment (band-width=4) Figure 2.15: Dynamic programming based sequence alignment58.3. RELATED WORK

Reference

C T G G C C C T C C C A T A G C C T G C T G T G G A C A A T C A G G A A G C C C CC T G G G C C TC T G G C C C TA G C C C T C CG C C G T C C CT C C C A T A T sorted

T T C C A T A C aligned

A T A T C C T G reads T C C T G C T GC T G C T G T GA G C T G T G GT A G A C A A TG G A C A A T CA T C C G G A AA T C A G G A AG G A A G C T CG C A G C C C Cvariant

Figure 2.16: Simpliﬁed elaboration of variant calling

VariantsRead Alignments p13 p12 p11.2 p11.1 q11.1 q11.21 q11.22q11.23 q12.1 q12.2 q12.3 q13.1 q13.2 q13.31 q13.32

Figure 2.17: A screenshot from IGV for the region chr22:27,103,514-27,109,534 from anNA12878 dataset aligned to the human genome59HAPTER 2. LITERATURE REVIEW sequence alignment sorting polishing reads(FASTQ file) variants/ methylated bases(VCF/TSV file)alignment records(SAM file) sorted alignment records(SAM file)reference genome(FASTA file)raw signals (FAST5 files) B a s e - c a lli n g Figure 2.18: Simpliﬁed third-generation nanopore workﬂow60.3. RELATED WORK q q q q q q q q s s t s s s t s s s s t s s s s s t s s s s s t s s s s s w query sequence t a r g e t s e qu e n c e (a) Banded sequence alignment (band-width=3) q q q q q q q q s s s t s s s s s s t s s s s s s s t s s s s s s t s s s t s s w t a r g e t s e qu e n c e query sequence (b) Adaptive banded sequence alignment Figure 2.19: Evolution of dynamic programming-based sequence alignment61HAPTER 2. LITERATURE REVIEW

FAF15586 FAF15665 FAF18554FAF04090 FAF05869 FAF09968FAB41174 FAB42473 FAB499081e+02 1e+04 1e+06 1e+02 1e+04 1e+06 1e+02 1e+04 1e+06010000200003000001000020000300000100002000030000

Read length (log scale) N u m be r o f r ead s Sample preparation method ligationrapidultra

Figure 2.20: Read length distribution nanopore consortium62.3. RELATED WORK

VariantsRead alignments

Figure 2.21: A screenshot from IGV for a region from NA12878 sample nanopore consortiumgenome project 63HAPTER 3. CACHE FRIENDLY VARIANT CALLING

Chapter 3

Cache Friendly Optimisation of deBruijn Graph based LocalRe-assembly in Variant Calling

H. Gamaarachchi ,A. Bayat, B. Gaeta, and S. Parameswaran, “Cache Friendly Optimisation of de Bruijn Graphbased Local Re-assembly in Variant Calling,” IEEE/ACM transactions on computational bi-ology and bioinformatics, 2018. DOI: https://doi.org/10.1109/TCBB.2018.2881975 [24].A variant caller is used to identify variations in an individual genome (compared to thereference genome) in a genome processing pipeline. For the sake of accuracy, modern variantcallers perform many local re-assemblies on small regions of the genome using a graph-basedalgorithm. However, such graph-based data structures are ineﬃciently stored in the linear64.1. INTRODUCTIONmemory of modern computers, which in turn reduces computing eﬃciency. Therefore, variantcalling can take several CPU hours for a typical human genome. We have sped up the localre-assembly algorithm with no impact on its accuracy, by the eﬀective use of the memoryhierarchy. The proposed algorithm maximises data locality so that the fast internal processormemory (cache) is eﬃciently used. By the increased use of caches, accesses to main memoryare minimised. The resulting algorithm is up to twice as fast as the original one when executedon a commodity computer and could gain even more speed up on computers with less complexmemory subsystems.

The capability of Next Generation Sequencing technology (NGS) has grown faster thanMoore’s law in the past few years [40]. Both the time and the cost of sequencing a genomehave dropped to an aﬀordable level and is expected to drop further [87]. However, the capabil-ity of the computing technology has not kept up with the pace of improvement in sequencingtechnology [154]. Hence, it is an increasing challenge for computers to process such massiveamount of data.The most commonly used NGS technologies produce a large number of short DNA fragmentsknown as reads. The reads are aligned to a reference genome using sequence aligners suchas BWA [70] and Bowtie [68]. After sequence alignment, a process called variant calling[155] identiﬁes the actual genomic variations, amongst the artefacts generated by sequencingmachines (sequencing errors). Traditional variant callers such as GATK UniﬁedGenotyper [80]fully rely on the alignment produced by sequence aligners. Sequence aligners only performpairwise alignment between an individual read and the reference genome. However, a moreaccurate alignment can be obtained by considering all the reads that are aligned to a particularregion of the genome. Tools such as GATK IndelRealigner [88] were designed to improve the65HAPTER 3. CACHE FRIENDLY VARIANT CALLINGaccuracy of alignments using information from all the reads mapped to a region. Such toolsare run prior to the use of a traditional variant calling algorithm.Modern variant callers, such as GATK HaplotypeCaller [80], Platypus [84], SOAPindel [156]and Scalpel [157] take a diﬀerent approach compared to traditional variant callers. Thesemodern variant callers utilise de-Bruijn graph-based de-novo local re-assembly, which assemblethe genome locally (only a small region) using all the reads mapped to that region. The localre-assembly results in greater accuracy in variant identiﬁcation [84]. The typical workﬂowof a modern variant caller is given below (though there are slight diﬀerences in each variantcalling tool, the basic workﬂow remains the same).•

Local re-assembly : A de-Bruijn graph [158] is formed using the reads aligned toa particular region of the genome. The graph is then traversed to detect candidatevariant sites. This graph construction and graph traversal are performed for a region ata time [84].•

Aligning reads to haplotypes : Haplotypes are formed using candidate variant sitesidentiﬁed through local re-assembly. Then, the reads in the corresponding region arealigned to each haplotype using a pairwise alignment algorithm such as Needlemann-Wunch [74] or pair-HMM [91].•

Finding statistically signiﬁcant variants : Statistical approaches are applied to ﬁndthe most probable variants using reads aligned to the haplotypes.Thus far, a number of studies have been performed to improve the variant calling process.However, most research on variant calling has been restricted to improving accuracy. Avery little attention has been paid to optimising core components of modern variant callingalgorithms, such as de-Bruijn based local re-assembly.Modern variant callers are compute and memory intensive compared to traditional variant66.1. INTRODUCTIONcallers. The main reason for this additional computation is the local re-assembly step. Thisis illustrated in Fig. 3.1 that shows the time requirement for diﬀerent steps of variant callingusing Platypus variant caller (refer to Section 3.5 for information on the datasets and themachine conﬁguration where the experiment was performed). De Bruijn graph-based assemblyinvolves two major steps; graph construction and graph traversal. For the three datasets inFig. 3.1, graph construction took around 66% of the total time. Graph traversal took less than0.5% of the total time. All other tasks including reads to haplotype alignment, probabilitycomputation and disk accesses added up to around 33% of the total time. This experimentsuggests that improving the performance of graph construction will result in a signiﬁcantincrease in overall performance. The graph construction is time-consuming since it requiresrandom accesses to memory which reduce locality of memory accesses. When there is localityin accesses to memory, most accesses can be handled by the fast internal processor memory(cache). If data exists in the cache (cache hit) it can be accessed quickly. If data does notexist in the cache (cache miss) it has to be loaded from main memory (RAM) which usuallytakes a much longer time.In this paper, we introduce several improvements to the original de-Bruijn graph-based localre-assembly algorithm such that locality in memory accesses is increased and the processorcache is utilised eﬃciently. One of the most eﬃcient improvements we apply is to exploitexisting alignment information in the graph construction process. Our proposed improvementsresult in the algorithm being about twice as fast as the original algorithm. In order to testthe proposed algorithm, we ported it into the Platypus variant caller. The modiﬁed Platypusimplementation is available at [159].The rest of the paper is organized as follows. Section 3.2 discusses related work. Then inSection 3.3 we explain the de-bruijn based local assembly algorithm. Then in Section 3.4 wepresent our optimization techniques. Next, in Section 3.5 we present the experimental setupand results. After that, Section 3.6 elaborates the future directions. Finally, we conclude inSection 3.7. 67HAPTER 3. CACHE FRIENDLY VARIANT CALLING

Dataset 3

Local re-assembly : graph constructionLocal re-assembly : graph traversalOther

Figure 3.1: Distribution of execution time for Platypus variant caller

Modern variant callers such as GATK HaplotypeCaller [80] and Platypus [84], consist oflocal re-assembly, reads to haplotypes alignment and variant identiﬁcation. In the GATKHaplotypecaller, aligning reads to the haplotypes (using the Pair-HMM algorithm) takes muchof the processing time [144]. However, by the use of Intel’s Advanced Vector Extension (AVX)instructions and FPGAs, this processing time can be considerably reduced (by 720X usingIntel AVX and 3,857X using FPGA [160]). The latest versions of GATK HaplotypeCallersupport Intel AVX and FPGA [161] (the latter as an experimental feature) . Furthermore,tools such as Avacado [162] consist of algorithms with lower time complexity for aligning readsto haplotypes. Platypus is a newer variant caller which is faster, with better indel accuracy Additionally, Sentieon Inc. company has optimised GATK without specialised hardware,however, it is commercial. 68.2. BACKGROUND OF DEBRUIJN GRAPH BASED LOCAL RE-ASSEMBLYthan the widely used GATK HaplotypeCaller [84, 163] . Platypus aligns reads to haplotypesusing a Single Instruction Multiple Data (SIMD) accelerated Needleman-Wunch algorithm.Hence the alignment phase is already fast. However, as discussed above more than 60% ofthe time of Platypus is spent on de Bruijn graph-based local re-assembly. De Bruijn graphs were ﬁrst used for de novo assembly, where assembling is done solely us-ing the reads when a reference genome is unavailable. Several researchers have proposedtechniques for de Bruijn graph optimisation, for de novo assembly, where the inputs to theassembler are unaligned reads. However, in the case of aligned reads for local re-assembly,the utility of the optimisations used for de novo assembly are limited. The graph for localre-assembly is much smaller: only a few megabytes for local re-assembly compared to severalgigabytes for de novo assembly. This is because the whole genome is considered at once for denovo assembly, whereas the local assembly is performed in small regions (For instance a regionis 1500 bases in Platypus). Consequently, optimisation techniques for de novo assembly havefocused on processing large graphs through memory size optimisation or parallelising. Worksuch as Cortex [112], SOAPdenovo2 [164] and [165] have compressed the graph in size, whileABySS [110] and [166] have parallelised the graph across clusters. However, as the graph issmall for local re-assembly, techniques used for large graphs are superﬂuous.In summary, the present bottleneck in modern variant callers is the local re-assembly process.To the best of our knowledge no one has focused upon optimizing de Bruijn graph-based localre-assembly by utilising the information from aligned reads for superior cache usage. Note that, the indel accuracy of GATK HaplotypeCaller is higher than Platypus today.However, at the time of publishing the manuscript it was not so.69HAPTER 3. CACHE FRIENDLY VARIANT CALLING

ACAAGACA ACAAGACA CCTGAG CCTGAG

TAGACA TAGACA

CATGAC CATGAC CATGAC CATGAC ACTTGA ACTTGA GGAGAT GGAGAT TGGACA TGGACA CTGGAC CTGGAC CCTGGA CCTGGAReferenceRead 1i = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 AAGACA AAGACA

SNV

Read Error

IndelRead 2Read 3Read 4 Read 5Read 6Read 7Read 8

Figure 3.2: A region of the reference and few mapped reads

In local re-assembly, the reference and the reads are used to construct a de Bruijn graph.Typically, a region of several thousands of bases is considered at a time. The reference andthe reads are broken into k-mers, such that adjacent k-mers overlap by k-1 bases (see examplein Fig. 3.2 and Fig. 3.3). Each unique k-mer forms a node in the graph, and the edges of thegraph link adjacent k-mers.This section gives a brief account of a typical de Bruijn graph construction algorithm for localre-assembly. The example region of the reference, and the mapped reads to the region in Fig.3.2 and its de Bruijn graph in Fig. 3.3 will be used throughout the algorithm explanation.In this simpliﬁed example, the read size and the k-mer size are 6 and 3 respectively (thisis for demonstration purpose only and in reality, they are around 100 and 15 respectively).70.3. BASELINE ALGORITHM

ACA CAG AGA GAA AAC GAG AGT GTC TC C

GTA TAC

GTT TTC TC ATAG AGG GGT Reference Read Reference and read

Figure 3.3: De Bruijn graph for the regionBases corresponding to a probable single nucleotide variant (SNV), read errors and an indelare shaded as per the legend in Fig. 3.2. The de Bruijn graph in Fig. 3.3 is constructed outof all the k-mers in the region in Fig. 3.2. Nodes can be either shared by both reference andreads or belong to only one, as shown in Fig. 3.3. In addition note that a node for a certaink-mer is unique, despite the number of occurrences of the k-mer in the reference or reads.The graph is typically stored in computer memory using dynamically allocated nodes wherememory pointers represent the edges.A hash table data structure is required so that repeated accesses to the same node can locatethis node quickly for performance reasons during construction of the graph. As an example,consider the ‘CAG’ k-mer in Fig. 3.3 which is shared by both reference and reads. Fig. 3.2shows that ‘CAG’ is repeated twice in the reference and ﬁve times in the reads. During graphconstruction, a node will be created at the ﬁrst occurrence, but it must be found within thecreated graph for each subsequent occurrence. Instead of exhaustively searching all nodes,the hash table is used for fast search. A hash table for the example in Fig. 3.3 is shownin Fig. 3.4. The index of the hash table for a particular k-mer is found by applying a hash71HAPTER 3. CACHE FRIENDLY VARIANT CALLING *TCA*TAC *TCA*TAC *GAA*AGA *GAA*AGA *GGT*TCC *GGT*TCC *TTC*CAG *TTC*CAG *TAG*GTA*AGT *TAG*GTA*AGT

Index *AAC*ACA *AAC*ACA*GTC*GTC *AGG*GTT*GAG *AGG*GTT*GAG Figure 3.4: The hash tablefunction to the k-mer. The simple hash function used in this example is the addition of theASCII values of the characters in the k-mer, subjected to the modulo operator of the numberof hash table entries. For instance, there are 8 entries in the hash table and the index for thek-mer ‘ACA’ can be found by (‘A’+‘C’+‘A’)%8 which is 5. Hence The address to the nodecontaining ‘ACA’ is in index 5 in Fig. 3.4 as denoted by ‘*ACA’.Algorithm 1 and Algorithm 2 show how the graph is constructed. Note that these algorithmsare not full implementations, but are the necessary components to explain our methodology.Algorithm 1 shows how the reference is loaded to the de Bruijn graph. Algorithm 2 illustrateshow reads are then loaded to the same graph.In Algorithm 1, reference is an array that contains the reference genome. The algorithmiterates through the reference while adding edges between adjacent k-mers. For the example72.3. BASELINE ALGORITHMin Fig. 3.2 and Fig. 3.3, the algorithm will add edges in the order, ACA-CAG, CAG-AGA,AGA-GAA etc.

Algorithm 1

Load the reference to de Bruijn Graph function loadReference ( reference ) for i = 0 : (region_size - kmer_size) do kmer1 ← reference[i : i+kmer_size ] kmer2 ← reference[i+1 : i+1+kmer_size] addEdge(kmer1,kmer2) end for end function In Algorithm 2, readsInWindow is a buﬀer consisting of all the mapped reads in the regionsorted using the mapped position. For each read, edges are added between adjacent k-mers. Inour example, ACAGAA is the ﬁrst read. Edges are added in the order ACA-CAG, CAG-AGA,AGA-GAA for this read. Similarly, all other reads will be processed.

Algorithm 2

Load reads to de Bruijn Graph function loadReads ( readsInWindow ) for read in readsInWindow do for i = 0 : (read_length-kmer_size) do kmer1 ← read[i : i+kmer_size ] kmer2 ← read[i+1 : i+1+kmer_size] addEdge(kmer1,kmer2) end for end for end function Algorithm 3 shows the implementation of addEdge function used in Algorithms 1 and 2. The hashTableLookUpOrInsert function in Algorithm 3 locates the corresponding node for a givenk-mer, using the hash table. If no memory pointer exists in the hash table for a node containing73HAPTER 3. CACHE FRIENDLY VARIANT CALLINGthe k-mer, then memory is allocated for a new node and the hash table is updated. Finally, hashTableLookUpOrInsert returns the memory pointer to the node. The addEdge functioncalls hashTableLookUpOrInsert on both kmer1 and kmer2 . Finally, createLink actually addsthe connection between the nodes by storing the pointer to the second node ptr2 in the ﬁrstnode . Algorithm 3

Add an edge connecting kmer1 and kmer2 function addEdge ( kmer1,kmer2 ) ptr1 ← hashTableLookUpOrInsert(kmer1) ptr2 ← hashTableLookUpOrInsert(kmer2) createLink(ptr1,ptr2) end function In this section, We show how the algorithms in Section 3.3 are modiﬁed to minimise accessesto the RAM. First, we give a simpliﬁed overview of our methodology. Then, we presentadditional technical information so that our method can be replicated in any de Bruijn graph-based variant caller.

The hash table produced during graph construction (explained in Section 3.3) is too large tototally reside in cache. Hence it must reside in RAM. Accesses to a hash table are randomaccesses and therefore cache misses will occur very frequently. We propose techniques tominimize these random memory accesses, by exploiting the following two factors: if the edge already exists, the weight parameter of the edge (details not discussed as notrequired to understand the cache behaviour) is updated74.4. METHODOLOGY• The input reads to the variant caller are already aligned to the reference -The input to a variant caller is a set of reads which are already aligned to the reference.Information from the already aligned reads can be exploited for eﬃcient utilization of thememory hierarchy, minimizing random accesses to the RAM. For instance, alignmentinformation can be used to predict the majority of memory locations. Such predictionscan minimise random accesses to the RAM.•

Variants and sequencing artefacts are rare - In cases of variants or sequencingartefacts, memory accesses will deviate from the expected pattern. For instance, pre-dictions that bypass the hash table can be incorrect, requiring random accesses to theRAM. However, these events are very low as we show in Section 3.5.Algorithm 1 is modiﬁed such that a cache friendly array is ﬁlled during the construction of k-mers for the reference genome. Then, Algorithm 2 is modiﬁed to utilise this ﬁlled cache friendlyarray, to reduce the memory accesses to the RAM. Furthermore, Algorithm 3 is introducedwith an additional variable which is register-friendly. This register-friendly variable acceleratesthe construction of graph edges by eliminating redundant accesses to the cache and the RAM.

In Fig. 3.2, for position 1 (i=1), CAG is the k-mer in the reference as well as the ﬁrst threereads. Similarly, if there were no variants (such as SNV, Indel) or read errors, the k-mer inthe reference and the reads should be the same for each position. However, variants and readerrors cause k-mers in the reads to diﬀer from that of the reference. For instance, the k-mers(at position 4) for the second to ﬁfth reads are TAC, TAC, TTC and TAG while the k-mer onthe reference for that particular position is AAC. This diﬀerence is due to the SNV at position4 and the read errors at positions 5 and 6 in the fourth and ﬁfth reads. However, variants andread errors are less frequent. Variation between a particular human genome and the reference75HAPTER 3. CACHE FRIENDLY VARIANT CALLING

ACAAGACA ACAAGACA CCTGAG CCTGAG

Referencei = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 *AGA*CAG*ACA*AAC*GAA*AGA*CAG*ACA *AGA*CAG*ACA*AAC*GAA*AGA*CAG*ACA *TCC*GTC*AGT*GAG *TCC*GTC*AGT*GAG

NodeCache *x refers to the memory pointer of the node containing x

AAGACA AAGACA

Read 1 GAAAGACAGACA GAAAGACAGACA

K-mers of read 1 h i t h i t h i t h i t CTGGAC CTGGAC

Read 7 GTCGGTAGGCAG GTCGGTAGGCAG

K-mers of read 7 h i t h i t m i ss m i ss Figure 3.5: Elaboration of how mapping information is used for improving cache performancehuman genome is about 0.5% [12], and the read errors in modern sequencing machines areabout 0.1% - 1% [51]. Thus, in about 99% of the time, the k-mers in the reads would beidentical to the k-mers in the reference at a particular position. This high probability canbe used to predict the memory addresses of nodes, minimizing accesses to the hash table.This is accomplished by a pointer array referred to as the nodeCache (see Fig. 3.5), which ispopulated when loading the reference to the graph.Fig. 3.5 elaborates on how the nodeCache is used. The reference and the two reads in Fig. 3.5used for illustration are taken from Fig. 3.2. The nodeCache in Fig. 3.5 has been populatedwhen loading the reference to the graph, by storing the node address that corresponds toposition i . For instance, position 0 on the reference forms the k-mer ACA and the memoryaddress for the node containing this k-mer is stored at index 0 of the nodeCache . This nodeCache is utilised when loading the reads to the graph. Consider the read ACAGAA and76.4. METHODOLOGYits k-mers in Fig. 3.5. This read is mapped to position 0 on the reference and hence thefour k-mers ACA, CAG, AGA and GAA map to positions 0, 1, 2 and 3 respectively. Whenloading these k-mers to the graph, the corresponding location in the nodeCache is looked upas shown using arrows in Fig. 3.5. As this read is exactly the same as the reference, all theaccesses are hits to the nodeCache . However, in the other read on Fig. 3.5, two misses haveoccurred due to the marked indel on the read. As explained earlier, variants and read errorsare less frequent and therefore, the misses are fewer. In the case of a hit to the nodeCache ,the node can be directly accessed. The hash table has to be accessed only in case of a missto the nodeCache . If the nodeCache was not used then all these accesses would have to gothrough the hash table, and thus access memory randomly.Although this nodeCache is an array originally residing in the memory, it is cache friendlydue to the spatial and temporal locality of accesses to the array. For instance, in Fig. 3.5,observe how the accesses to the nodeCache (marked using arrows) for consecutive k-mers ina read exhibit spatial locality. Temporal locality is observed for accesses to the NodeCache due to consecutive reads. In Fig. 3.2 note how consecutive reads overlap to each other.Hence, locations in the nodeCache corresponding to a read are re-accessed for the next read,exhibiting temporal locality. For instance, the k-mer CAG which is accessed for the ﬁrst readACAGAA has to be accessed again when processing the second read CAGTAC. Due to boththis spatial and temporal locality among accesses, the nodeCache array is cache friendly.In addition to the above, the end node of an edge will always be the start node of the nextedge, throughout the read. For instance, the edges will be added in the following order: 1.ACA-CAG; 2. CAG-AGA; and 3. AGA-GAA for the read ACAGAA. Note how CAG andAGA which are the end nodes for edges 1 and 2, are the starting nodes for edges 2 and 3,which causes repeated accesses to memory locations. This pattern is observed for the referenceas well. Though these accesses are already cache friendly due to the temporal locality, theycan be made even faster by using a register (which we refer to as lastAccess ). This lastAccess register is used to store the memory pointer to the end node of the current edge, to be used77HAPTER 3. CACHE FRIENDLY VARIANT CALLINGwhen loading the start node of the next edge. In an implementation for a general purposeprocessor, lastAccess can be a globally declared variableImplementation for the method above is given in Algorithm 4 and Algorithm 5. Algorithm4 shows how the reference is loaded to the de Bruijn graph. A globally accessible variable lastAccess is used for storing the end node of each edge (line 1 of Algorithm 4). A globallyaccessible array nodeCache is initialized with NULL pointers (line 2 of Algorithm 4). Thek-mers are extracted from the reference as previously. Function addEdge called at line 7 ofAlgorithm 4 now returns the pointer ( ptr1 ) to kmer1 which is then stored in the nodeCache (at line 8 of Algorithm 4).

Algorithm 4

Load the reference to de Bruijn Graph global lastAccess = NULL global nodeCache[region_size] = {NULL} function loadReference ( reference ) for i = 0 : (region_size-kmer_size) do kmer1 ← reference[i : i+kmer_size] kmer2 ← reference[i+1 : i+1+kmer_size] ptr1 ← addEdge(kmer1,kmer2,i,i+1) nodeCache[i] ← ptr1 end for end function Algorithm 5 shows how reads are loaded. Function getMappedPosition at line 3 retrieves theposition on the reference to which the read is mapped. The positions of the k-mers are thencomputed and provided as arguments to addEdge (line 7-9 of Algorithm 5).Algorithm 6 elaborates the addEdge function. First, kmer1 (which is the start node of thecurrent edge) is compared with the k-mer in lastAccess at line 2 of Algorithm 6. If identical,there is no requirement to inspect the

NodeCache or hash table. Otherwise,

LookUp will be78.4. METHODOLOGY

Algorithm 5

Load reads to de Bruijn Graph function loadReads ( readsInWindow ) for read in readsInWindow do readMapping = getMappedPosition(read) for i = 0 : (read_length-kmer_size) do kmer1 ← read[i : i+kmer_size] kmer2 ← read[i+1 : i+1+kmer_size] pos1 ← readMapping+i pos2 ← readMapping+i+1 addEdge(kmer1,kmer2,pos1,pos2) end for end for end function called which will inspect the NodeCache at line 5 of Algorithm 6. Unlike kmer1 , the endnode of the edge kmer2 is directly checked in the

NodeCache by calling

LookUp at line 7of Algorithm 6. Finally, the end node of the current edge is backed up on lastAccess forfuture use and createLink is called (line 8-9 of Algorithm 6). createLink is same as describedpreviously in Section 3.3.The function

LookUp in Algorithm 7 which is called at lines 5 and 7 of Algorithm 6, attemptsto locate the node for the k-mer in the nodeCache . If it is a hit to the nodeCache , the pointerto the node can be immediately returned (line 3-4 of Algorithm 7). The hash table is accessedonly in case of a miss (line 6 of Algorithm 7).Fig. 3.6 summarises how memory accesses are allocated. A memory access to lookup thememory address of a node in the graph corresponds either to a start node or an end node ofan edge in the graph (as explained previously). If the access is for a start node, the lastAccess register will be inspected as shown in Fig. 3.6. If a hit occurs when looking up the lastAccess

Algorithm 6

Add an edge connecting kmer1 and kmer2 function addEdge ( kmer1,kmer2,pos1,pos2 ) if lastAccess!=NULL and lastAccess.kmer==kmer1 then ptr1 ← lastAccess else ptr1 ← LookUp(kmer1,pos1) end if ptr2 ← LookUp(kmer2,pos2) lastAccess ← ptr2 createLink(ptr1,ptr2) return ptr1 end function register, the lookup process completes at the cost of only reading that register. In case of amiss to the lastAccess register, the node will be looked up in the nodeCache as shown in theﬁgure. In case of a hit to the nodeCache , the process ends there, at the cost of accessing cachememory. In case of a miss to the nodeCache , the hash table has to be accessed as shown. Thecost of accessing the hash table residing in the RAM is high, but misses that end up in thehash table are rare (as mentioned previously). For end nodes, the inspection starts directlyfrom the nodeCache as shown in Fig. 3.6. In summary, if not for the lastAccess register andthe nodeCache all the accesses would directly go to the hash table causing frequent accessesto RAM. All experiments were performed on a server with four Intel Xeon X7560 processors (total of32 cores / 64 CPU threads and 256 GB of RAM). The Platypus variant caller (downloaded80.5. RESULTS

Algorithm 7

First lookup in the nodeCache and then in hash table function LookUp ( kmer,pos ) cacheItem = nodeCache[pos] if cacheItem!=NULL and cacheItem.kmer==kmer then ptr ← cacheItem else ptr ← hashTableLookUpOrInsert(kmer) end if return ptr end function from [167]) that implements the algorithm in Section 3.3 is referred to as the baseline imple-mentation . The modiﬁed version of Platypus based on the method in Section 3.4 is referredto as the optimised implementation . Three real datasets from the 1000 genomes project (samewhole genome sequencing data used in [84] for performance assessment of Platypus, down-loaded from [168]) were used for experiments. The three datasets are aligned 75-86 X 100-bppaired-end Illumina HiSeq 2000 reads (BAM ﬁles) for the parent-oﬀspring trio NA12878,NA12891 and NA12892. Platypus was run using the default parameters with de Bruijn basedassembly turned on. Variants were called on all chromosomes including X and Y.Section 3.4 described how accesses to the hash table are minimised by using a register thatstores the previous node (referred as lastAccess ) and a cache friendly array (referred as Node-Cache ). Fig. 3.7 shows how memory accesses to lastAccess , nodeCache and the hash tableare distributed. The X-axis in Fig. 3.7 denotes the dataset and the type of memory access.Y-axis shows the access percentage for each item on the X-axis. The data used to computethe percentages in Fig. 3.7 are given in Table 3.1. These data in the table were obtained byrunning the optimised implementation with software counters introduced to count diﬀerentmemory accesses. The ﬁrst column of Table 3.1 is the dataset and the second column is thememory access type. The third column contains the number of memory accesses occurred81HAPTER 3. CACHE FRIENDLY VARIANT CALLING RAM cacheregister lastAccessnodeCachehash table memory access for locating node start node end nodehit hit miss miss memory address for the node found

Figure 3.6: Summary of the outcome of the proposed methodwhen locating start nodes of the edges in the de Bruij n graph. Similarly, the fourth columnis for end nodes. The last column is the total number of accesses which is the sum of columnsthree and four. Note that the numbers are given in x10 . The number of hits to the las-tAccess register is equal to the number of times the program reaches line 3 in Algorithm 6.Only the accesses to start nodes are responsible for lastAccess hits. Therefore, lastAccess hitsdue to end nodes are 0 as shown in the Table. Similarly, Nodecache accesses and hash tableaccesses map to lines 4 and 6 respectively in Algorithm 7. The function

LookUp in Algorithm7 called at line 5 of Algorithm 6 corresponds to start nodes. Similarly,

LookUp called at line82.5. RESULTS

Last

Access

Node

Cache

Hash

Table

Last

Access

Node

Cache

Hash

Table

Last

Access

Node

Cache

Hash

TableNA12878 NA12891 NA12892 A cc e ss p e c e n t a g e Start node End node

Figure 3.7: Distribution of memory accesses in the optimised implementation7 of Algorithm 6 corresponds to end nodes. The data in Table 3.1 includes accesses occurredwhen loading both the reference and the reads to the de Bruijn graph. The percentage valuefor each item in Fig. 3.7 is calculated out of the total memory accesses for that data set.For instance, the total memory accesses for dataset NA12878 is the sum of the three values317.36,301.35 and 25.66 in the last column of Table 3.1. The values 317.36, 4.71, 296.64, 0.11and 25.55 for dataset NA12878 when expressed as a percentage of the above sum equates thepercentages 49.25%,0.73%, 46.04%, 0.02% and 3.96% respectively in Fig. 3.7.In Fig. 3.7, observe that for all datasets about 49% of total accesses are hits to the lastAccess register. Note that all hits to lastAccess are for start nodes. Then about 46.5% of accessesare hits to the nodeCache . The majority are from end nodes, as most of the start nodeshave already been resolved through the lastAccess register. Observe that only about 4.5% ofthe accesses must go to the hash table. The implications of this ﬁgure are that only 4.5%83HAPTER 3. CACHE FRIENDLY VARIANT CALLINGTable 3.1: Memory access distribution in the optimised implementationDataset Memory access type Start nodes( × ) End nodes( × ) Total( × )NA12878 Last Access 317.36 000.00 317.36Node Cache 004.71 296.64 301.35Hash Table 000.11 025.55 025.66NA12891 Last Access 288.57 000.00 288.57Node Cache 004.59 268.26 272.85Hash Table 000.11 025.02 025.13NA12892 Last Access 284.67 000.00 284.67Node Cache 004.61 264.54 269.15Hash Table 000.11 024.86 024.97of the accesses are misses to the RAM and therefore the techniques presented in Section3.4 have enabled eﬃcient usage of the memory hierarchy. However, 4.5% is higher than thepercentage anticipated in Section 3.4 mainly because the values in Fig. 3.7 also include thememory accesses when loading the reference to the graph. Additionally, alignment artefactswould cause mismatches between k-mers in reference and reads, which in turn would also haveincreased the percentage.Fig. 3.8 compares the time taken for graph construction by the baseline implementation andthe optimised implementation. For each dataset, each implementation was run using 8, 16,32 and 64 threads, which are shown as 8t, 16t, 32t and 64t along the X-axis. The Y-axisshows the runtime for graph construction for each case, in seconds. In all cases, the optimisedimplementation was at least twice as fast as the baseline implementation . Since the executiontimes for Platypus is considerable for the three large datasets used, the execution times givenin Fig. 3.8 are an average of three repetitions. All three repetitions consistently produced The overall speedup of Platypus with our optimised implementation integrated wasaround 1.4-1.6 times 84.5. RESULTSvalues which were signiﬁcantly similar on a general-purpose server (Supplementary Table S1).To further validate the claims on the speed-up, we performed two experiments with smalldatasets so that each test can be repeated a large number of times, with the intention totest the following: 1. the variability of the execution time for the same dataset (randomnessdue to the operating system scheduling); and 2. the variability in the speedup for diﬀerentdatasets.In the ﬁrst experiment we repeatedly ran the baseline implementation and the optimisedimplementation for a single data set 100 times for each implementation (Supplementary Ta-ble S2). Only the chromosome 1 of the NA12878 dataset is considered and executed with64 threads for 100 repetitions. The two distributions (baseline implementation and opti-mised implementation) are near normal (Supplementary Figure S1 and S2 - the two outliersare explained in the ﬁgures). We performed an independent random sample t-test on log-transformed data to test the null hypothesis that the data in the two distributions come frompopulations with equal means. The log-transformed values were preferred, as the exponent ofthe diﬀerence between two means provides the speed-up. The mean speed-up was 2.047, witha 95% conﬁdence interval of 2.042-2.053. The null hypothesis could be rejected at the 5%signiﬁcance level with a p-value <0.0001. Hence, we may conclude that the observed speed-upis not due to random variations.In the second experiment we executed the baseline implementation and the optimised imple-mentation on 69 diﬀerent datasets (Supplementary Table S3). Each chromosome (chr 1-22and chr X) of the three datasets NA12878, NA12891 and NA12892 were considered as aseparate dataset (thus the 69 diﬀerent datasets). A paired sample t-test was performed onthe log-transformed data (X - log transformed times for baseline implementation and Y - logtransformed times for optimised implementation) to test the null hypothesis that the mean ofX-Y is equal to 0 (speed-up is 1). The distribution of X-Y was near normal (SupplementaryFigure S3). The mean speed-up was 2.028, with a 95% conﬁdence interval of 2.017-2.039. Thenull hypothesis could be rejected at the 5% signiﬁcance level with a p-value <0.0001. Hence,85HAPTER 3. CACHE FRIENDLY VARIANT CALLING

NA12878 NA12891 NA12892 T i m e f o r g r a ph c o n s t r u c t i o n / ( s ) Existing Implementation Optimised Implementation

Figure 3.8: Execution time for the baseline implementation and the modiﬁed implementationthe speed-up is evident across diﬀerent datasets.

According to proﬁling done on the baseline implementation, approximately about:(a) 70% of the memory accesses to the RAM (during graph construction) are due to thehash table, and(b) 30% are other memory accesses not due to the hash table.Our optimisation technique reduced only (a) memory accesses. The speed up for the wholegraph construction process was about 2X. However, note that this speed up was obtainedpurely by modifying the algorithm that runs on a general purpose processor. On general86.6. DISCUSSIONpurpose processors, the programmer’s control of the caches and the registers is limited. Hence, nodeCache is implemented as an array that originally resides in the RAM, and lastAccess isimplemented as a global variable. In contrast, it is possible to have an exclusive cache forthe nodeCache and an exclusive register for lastAccess when building a custom processor.Therefore, the proposed algorithm opens the door to building custom processors such asApplication Speciﬁc Instruction Set Processors (ASIP), where the baseline algorithm is notsuitable. In such a case, the observed speed up would be higher. Furthermore, the proposedalgorithm can lead to eﬃcient local re-assembly implementations for any other system havinga memory hierarchy such as Graphics Processing Units (GPU) and Field Programmable GateArrays (FPGA),RAM accesses in (b) are due to other memory accesses that occur during graph construction,such as,1. reading the reference genome in Algorithm 1 (line 3 and 4) ,2. reading the reads in Algorithm 2 (line 4 and 5) , and3. writing to the located nodes to add connection between k-mers inside createLink function(called at line 4 of in Algorithm 3)Out of these, 1 and 2 are already cache friendly due to the spatial locality of access. Accessesto the RAM are still caused when reﬁlling cache lines. In contrast, 3 is not cache friendly dueto the large size of the data structure that stores a node. A node contains space for ﬁeldssuch as memory pointers to adjacent nodes and the total node size is even larger than thesize of a cache line of a general purpose processor. For instance, the typical cache line size ofa CPU is 64 bytes and the size of a node in Platypus is 65 bytes. Hence at least one access tothe RAM is required for each node access to ﬁll a cache line. A large cache line size that ﬁtsseveral nodes to one cache line can be implemented during custom processor construction. In87HAPTER 3. CACHE FRIENDLY VARIANT CALLINGaddition, strategies such as cache pre-fetching (fetching the next adjacent cache line from theRAM before it is actually required) would be helpful to boost the performance.

The de Bruijn graph construction during the local re-assembly step of modern variant callersconsumes more than 60% of the total variant calling time. We have shown how the existingalgorithm can be modiﬁed such that the locality of memory accesses are improved, whichin turn improves the eﬃcient usage of faster cache memories. The results show that thesechanges improve the performance of de Bruijn graph construction by a factor of around twowhen implemented on a general purpose processor. The modiﬁed algorithm opens the door tomuch higher acceleration of local re-assembly on GPU, FPGA and ASIP. The implementationof the algorithm which is integrated into the Platypus Variant Caller is publicly availableat [159]. 88 hapter 4

Featherweight Long ReadAlignment using PartitionedReference Indexes

This chapter is published in Nature Scientiﬁc Reports under Creative Commons CC BYlicense at

H. Gamaarachchi , S. Parameswaran, and M. A. Smith, “Featherweight longread alignment using partitioned reference indexes,” Scientiﬁc Reports 9, 4318 (2019). DOI: https://doi.org/10.1038/s41598-019-40739-8 [25]The advent of Nanopore sequencing has realised portable genomic research and applications.However, state of the art long read aligners and large reference genomes are not compatiblewith most mobile computing devices due to their high memory requirements. We show howmemory requirements can be reduced through parameter optimisation and reference genomepartitioning, but highlight the associated limitations and caveats of these approaches. We then89HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENTdemonstrate how these issues can be overcome through an appropriate merging technique. Weincorporated multi-index merging into the Minimap2 aligner and demonstrate that long readalignment to the human genome can be performed on a system with 2GB RAM with negligibleimpact on accuracy.

Long read sequencing has revolutionised genome research by facilitating the characterisationof large structural variations, repetitive regions, and de-novo assembly of whole genomes.Paciﬁc Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are leading manufac-turers that produce long read sequencers. In particular, ONT manufacture sequencers smallerthan the size of a mobile phone that can nevertheless output more than 1TB of data in 48hours. Such highly portable sequencers have realised the possibility of performing genomesequencing in the ﬁeld. For instance, ONT’s MinION sequencer has been used for Ebola virussurveillance in Guinea [16], mobile Zika virus surveillance in Brazil [17], and for experimentson the International space station [19].The advent of highly portable DNA sequencers raise the need for local data processing ondevices such as mobile phones, tablets and laptops. Facilitating genomic data analysis onmobile devices avoids the need for high speed internet connections and enables real-time ge-nomic tests and experiments. For Nanopore sequencers, a pico-ampere ionic current signal isproduced for each DNA read, which is subsequently converted to nucleotide bases via appliedmachine learning models. Until recently, a high performance workstation (Quad-core i7 orXeon processor, 16GB RAM, 1TB SSD) was required for live base calling, the process ofconverting ionic signal to nucleotide sequences.90.1. INTRODUCTIONMost genomic analyses depend on base calling as an initial step, which can be eﬃciently per-formed through GPGPU software implementations on graphics cards or, quite conveniently,on dedicated portable hardware (ONT manufacture one such device, termed ‘MinIT’). Next,base called reads are typically aligned/mapped to a reference, in case of reference guidedassembly, or aligned to themselves in case of de-novo assembly. Subsequent analyses (i.e.consensus sequence generation, variant calling, methylation detection, etc) should follow thisalignment step. Therefore, an alignment tool that can run on portable devices such as mo-bile phones, tablets and laptops is the next step in realising the full portability of the wholeNanopore processing pipeline.Minimap2 [101] is a general purpose mapper/aligner that is compatible with both DNA andRNA sequences. Minimap2 can align both long reads and short reads, either to a referenceor an assembly contig. Minimap2 ﬁrst employs hashing followed by chaining for coarse grainalignment. Then it performs an optional base level alignment using an optimised implementa-tion of the Suzuki-Kasahara DP formulation [102]. Minimap2 stands out as the current alignerof choice for long reads, among other long read aligners such as BLASR [96], GraphMap [97],Kart [98], NGMLR [99] and LAMSA [100]; not only is it 30 times faster than existing longread aligners, but its accuracy is on par or superior to other algorithms [101]. Hash tablebased approach in Minimap2 has shown to be eﬀective against long reads. In contrast, FM-index [67] based short read aligners such as BWA [70] and Bowtie [68] have shown to fail withultra long reads (i.e. several hundred kilobases or more) [56].Most alignment tools build an index of reference sequences that is stored in volatile mem-ory. Whilst this is manageable for small genomes such as individual bacteria ( ∼ ∼ ∼ With default options, Minimap2 requires more than 11GB of memory to create an indexfrom the human reference genome sequence and align Nanopore reads against it (Table 4.1).92.2. RESULTSAlthough the pre-calculated index can be saved to disk, 7.7GB are nonetheless required tosubsequently load the index into memory, and between 8.8 and 11.3GB are required whenintermediate data structures during alignment are included. This exceeds the average RAMcapacities of high-end mobile phones and mid-range laptops. Hence, running Minimap2 onhuman data with default options on a typical laptop with 8GB memory or a typical mobilephone with 2GB of memory is not feasible.Table 4.1:

Memory usage of Minimap2 for default parametersPacBio Oxford nanoporeIndex construction

Index residence

Mapping with base-level alignment (SAMoutput)

Mapping without base-level alignment(PAF output)

Minimap2 was run with default parameters. Pre-set proﬁles map-pb and map-ont were used for PacBio andOxford Nanopore, respectively. The peak memory usage for each event is presented. Index constructionrefers to the building of the index and then serialising the index to a ﬁle. Index residence is the memoryrequired only for the index to reside in memory, such as when loading a pre-built index.

We therefore tested the relative eﬀect of alignment parameters on peak memory usage inMinimap2 (see Materials and methods) to investigate if parameter optimisation alone cansigniﬁcantly reduce the memory requirements without compromising alignment quality. Forthis purpose, we used Sequins—synthetic DNA spike-in controls that are designed from thereverse or ‘mirrored’ human genome sequence [171]. This chirality reproduces diverse proper-ties of the human genome, such as nucleotide frequencies, complexity, repetitiveness, somaticvariation, etc. As detailed in Materials and methods, we aligned Nanopore sequencing datafrom Sequins to both native and reversed (not complemented) human reference genomes tocompare the relative impact of Minimap2 parameters on alignment accuracy. Speciﬁcally:• k the minimiser k-mer length (default = 15 for ONT data);93HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENT• w the minimiser window size (default = 10);• t the number of threads (default = 4);• K the number of query bases loaded into memory at a time (default = 500M).Parameters k and w considerably aﬀect the peak memory usage for holding the index inmemory (Figure 4.1a). For an index without homo-polymer compression, k = 15 consumedthe least amount of memory out of the values tested, as expected. In fact, the default k-mersize for ONT data in Minimap2 (pre-set command line parameter map-ont ) is 15.Unsurprisingly, parameter w has the most prominent impact on memory usage, which de-creases considerably when increasing w (Figure 4.1b). At w =50, memory usage is capped at3GB, but the sensitivity (see Materials and methods) is substantially reduced compared tothe default value of parameter w (missing mappings in Figure 4.1c). A larger w of 50 reducessensitivity compared to the default value of w by 20%, whereas a w of 25 entails an apparentreduction in sensitivity of about 7% while nonetheless requiring more than 4GB of memory.Although suﬃcient for a computer with 8GB of RAM, this is still too high for smaller devices.Importantly, the amount of mismatched mappings in mapped reads are not signiﬁcantly af-fected by the w parameter (mismatches in Figure 4.1c), nor are the high-quality alignments asdemonstrated by their dynamic programming (DP) alignment score distribution (cf. DP score> 2000 in Figure 4.1d). However, lower ( ≈ w = w = w = w = w = w = w = w = w = w = w = w = w = w = w = k=15 k=20 k=25 B ) G ( x e dn i e h t r o f e g a s u y r o m e m k a e P w=5 w=10 w=15 w=20 w=25 w=30 w=35 w=40 w=45 w=50 k=15 B ) G ( x e dn i e h t r o f e g a s u y r o m e m k a e P

15 20 25 30 35 40 45 50 ) % ( e g a t n e c r e p g n i pp a M Window size (w)Mismatching mappings % Missing mappings %Extra mappings % c oun t window sizew=10w=15w=20w=25w=30w=35w=40w=45w=50 R un (cid:415) (cid:373) (cid:286) ( s ) (cid:69)(cid:437)(cid:373)(cid:271)(cid:286) r of thr (cid:286) ads ) B G ( y r o (cid:373) (cid:286) (cid:373) k a (cid:286) P (cid:87)(cid:286)(cid:258)(cid:364)(cid:3)(cid:3)(cid:373)(cid:286)(cid:373)(cid:381)(cid:396)(cid:455)(cid:3) (GB) (cid:90)(cid:437)(cid:374)(cid:415)(cid:373)(cid:286)(cid:3)(cid:894)(cid:400)(cid:895) R un (cid:415) (cid:373) e ( s ) Minibatch size (Mega bases) ) B G ( y r o (cid:373) e (cid:373) k a e P (cid:87)(cid:286)(cid:258)(cid:364)(cid:3)(cid:373)(cid:286)(cid:373)(cid:381)(cid:396) y (GB) (cid:90)(cid:437)(cid:374)(cid:415)(cid:373)(cid:286)(cid:3)(cid:894)(cid:400)(cid:895) a bc de f Figure 4.1:

Eﬀect of parameters on memory usage, performance and accuracy. ( a ) Peak memory usage of the index for diﬀerent combinations of k and w . ( b ) Peak memory usage of the indexfor a large range of w with k =15. ( c ) The eﬀect of w on sensitivity and error relative to the default windowsize. The x-axis is the minimiser window size ( w ). The k-mer size is held constant at 15 for all values of w .The y-axis shows the number of missing mappings / mismatches or extra mapping (compared to the mappingsfrom default w =10) as a percentage of the number of reads. ( d ) Distribution of the dynamic programmingalignment score for diﬀerent minimiser window sizes ( w ). The x-axis is the score and the y-axis is the smoothednumber of mappings for a particular score. Note that the distribution is smoothed to show the trend. ( e )Eﬀect of the number of threads on memory and performance. The parameters k , w and K were held constantat 15, 25 and 200M respectively while changing the number of threads. Both the peak memory usage and theruntime were measured on a PC with an Intel i7-6700 CPU and 16GB of RAM. ( f ) Eﬀect of the number ofquery bases loaded at a time. The parameters k , w and t were held constant at 15, 25 and 8 respectively. w10 w15 w20 w25 w30 w35 w40 w45 w500 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 600e+002e+054e+056e+05 MAPQ N o . o f m app i ng s Figure 4.2:

Eﬀect of the window size parameter w on the distribution of mappingqualities (MAPQ) for synthetic spike-in controls. Minimap2 allows the reference index to be split by a user speciﬁed number of bases throughthe option I , eﬀectively dividing a reference into smaller indexes of comparable size. Thisfacilitates parallel computation and, in theory, enables lower peak memory requirements.However, this feature is not ideal for mapping single reads to large references, mainly be-cause global contiguous information about the reference is unavailable. As a result, severalmapping artefacts can occur, as listed below and in Figure 4.3 (N.B. these may not be asprominent when overlapping reads–the application for which index partitioning in Minimap2was originally developed).1. The mapping quality is incorrect.The mapping quality estimated in Minimap2 is accurate as it deliberately lowers themapping quality for repetitive hits. However, this is not possible when only a fractionof a whole genome is present in the index (see supplementary materials of Li, H. [101]).In a partitioned index, if the same repeat lies across diﬀerent partitions, the mappingquality will be overestimated (Figure 4.3b.)2. Incorrect alignment ﬂags.For a chimeric read where diﬀerent sub-sequences map to diﬀerent chromosomes, thesupplementary mappings would be marked as primary mappings (Figure 4.3a). A repeatcontaining read that maps to multiple locations across diﬀerent partitions will have97HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENT Single reference index

Chimeric read or contig

Chr 1 Chr 2Chr 1 Chr 2Partitionedreference index Chr 1 Chr 2Partitionedreference index

Repeat-containing read or contig

Chr 1 Chr 2 Q =30 Q =20 Q =30 Q =20 Q =8 Q =10 Chr 1 Chr 2Partitionedreference index Q =3 Primary Supplementary SecondaryAlignment type: Q =2 Q =10 Q =1 Q =2 Q =1 abc Single reference index Chr 1 Chr 2Single reference index Q =8 Q =3 Q =2 Q =2 Q =10 Q =1 Repeat-containing read or contig

Figure 4.3:

Eﬀect of aligning sequences to single vs partitioned indexes.

Uniquely mapping chimeric reads ( a ) can be reconstructed from a partitioned reference index with relativeease. However, sequences (or sub-sequences) that are diﬃcult to map (i.e. low complexity regions, repetitiveelements, etc) can cause artefacts when aligning to a partitioned reference index. ( b ) An example where onepartition (chr2) contains less homologous sequences to the query sequence, producing the situation where thebest alignment when using a single reference is not achieved. ( c ) An example where a partitioned referenceintroduces several additional low quality mappings that would be dismissed with a single reference index. Q :mapping quality score. Mapeval utility in

Paftools (a tool bundled with Minimap2 for evaluating alignment accuracy)is not compatible with such outputs. Sorting by the read identiﬁer would ﬁx the issue,but requires signiﬁcant computations for large ﬁles.5. Incomplete headers in the sequence alignment/map (SAM) outputFor a partitioned index, Minimap2 suppresses the reference sequence dictionary (SQlines) in the SAM header. Users must manually add SQ lines to the header for compat-ibility with downstream analysis tools.We resolved these issues by serialising and storing the internal state of Minimap2 while map-ping reads, then merging the output and processing the result a posteriori (see Materials andmethods). The accuracy of this technique is discussed below.99HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENT

We compared the alignment accuracy between a single reference index and a partitioned index,with and without merging the output. The following acronyms will be used in the subsequenttext (see Materials and methods for more details):• single-idx : Aligning reads to a single reference index;• part-idx-no-merge : Aligning reads to a partitioned index without merging the output;• part-idx-merged : Aligning reads to a partitioned index while applying our merging tech-nique.

Synthetic long reads were used as a ground truth for the evaluation of alignment accuracy(see Materials and methods). The accuracy of part-idx-merged is similar to single-idx , despiteemploying signiﬁcantly more partitions (Figure 4.4a and 4.4b) as exempliﬁed by the overlapof their respective curves.In contrast, the results of part-idx-no-merge are considerably less accurate, in particular forlarger quantities of index partitions. A lower error rate is observed for part-idx-merged whencompared to single-idx for the lowest mapping quality values, but this eﬀect is marginal andis associated with low sensitivity.

As no ground truth is available for biological data, we evaluated alignment accuracy by com-paring the number of primary/secondary alignments and unmapped reads across single and100.2. RESULTS F r a c t i on o f m apped r ead s F r a c t i on o f m apped r ead s single-idx 2-part-idx-merged 4-part-idx-merged4-part-idx-no-merge 8-part-idx-merged8-part-idx-no-merge 16-part-idx-merged16-part-idx-no-merge2-part-idx-no-merge + + N u m be r o f p r i m a r y a li gn m en t s single-idx 16-part-idx-no-merge 16-part-idx-merged N u m be r o f i t e m s ( x ) Reads Unaligned entriesSecondary alignmentsPrimary Alignments a bc d

Figure 4.4:

Eﬀect of using partitioned indexes versus a single reference index onalignment quality. ( a ) Base-level and ( b ) locus- or block-level alignment accuracy from synthetic long reads. The x-axis showsthe error rate of alignments in log scale (see Materials and methods). The y-axis shows the fraction of alignedreads out of all input reads. Each point in the plot corresponds to a mapping quality threshold that varies from0 (top right) to 60 (bottom left). These plots are akin to precision-recall plots with the x-axis inverted. ( c )and ( d ): Alignment statistics for Nanopore whole genome sequencing data from NA12878 [56] using a 16-partindex. ( c ) The number of total entries (primary+secondary+unaligned) for single-idx , , and , in log scale. The dotted horizontal line represents the number of reads. ( d ) Number ofprimary mappings in function of Minimap2 mapping quality (log scale). single-idx , Minimap2 outputs 12.1GB of base-level align-ment data in SAM format, whereas part-idx-no-merge generates much larger output (180GB).However, part-idx-merged generates 12.4GB of data–proportional to the output produced with single-idx . Hence, part-idx-merged reduces disk usage by about 14-fold compared to part-idx-no-merge . Peak disk usage is also minimised in part-idx-merged as only intermediate align-ments are serialised as temporary binary ﬁles. The resulting size of temporary ﬁles generatedwith part-idx-merged is 29.2GB, thus achieving maximal disk usage of 41.6GB, 4 times lessthan part-idx-no-merge . The increased output produced by part-idx-no-merge is due to re-dundant unmapped entries and spurious mappings (Figure 4.4c and Table 4.2).Table 4.2: Statistics for alignment outputs for 689,781 reads from NA12878single-idx 16-part-idx-no-merge 16-part-idx-mergedFile size (SAM ﬁle)

No of SAM entries

No of unaligned en-tries

No of aligned entries

No of primary align-ments

No of secondaryalignments single-idx and part-idx-merged are comparable to the number of input reads (689,781), while part-idx-no-merge generates abundant—presumably spurious—hits. Furthermore, the distribution of mappingqualities for primary alignments between part-idx-merged and single-idx are quite similar(Figure 4.4d). Interestingly, part-idx-merged produces slightly more primary alignments withlower mapping quality scores than single-idx , a likely consequence of sampling less repetitiveregions in partitioned indexes. All the strategies produce almost the same amount of map-pings with quality = 60. In contrast, part-idx-no-merge has a very high number of spuriousmappings for mapping qualities between 0 to 59.102.2. RESULTSA more detailed comparison of 689,781 ONT reads aligned using single-idx and revealed the following: 120,623 (17.49%) reads were unmapped in both; 152 (0.02%)reads mapped only in single-idx ; 6,554 (0.95%) reads mapped only in ;562,452 (81.54%) reads mapped in both. Less than 1% of all reads presented discordant map-pings when using a single or a multi-part index. Of those discordant mappings, 6,423 (95.8%)overlapped regions in the human genome annotative as repeat elements or low-complexitysequences, the majority of which were satellites and ALR/Aplha repeats from centromeres(Figure 4.5 and 4.6). Furthermore, most (97.7%) of these index-speciﬁc unique alignmentsstem from the multi-part index, which suggests that a reduced search space can help Minimap2map less complex sequences, presumably through more frequent recourse of the dynamic pro-gramming step in Minimap2.Among the 562,452 reads that mapped in both single-idx and , 545,306(96.95%) had the exact same primary mappings (same chromosome, strand and position).Out of the remaining 17,146 aligned reads (3.05%): 2,748 (16.02%) of mapping coordinatesoverlapped by at least 10% in both sets; 952 (5.55%) were classiﬁed as supplementary map-pings in and the primary mapping in single-idx ; 3,891 (22.69%) wereclassiﬁed as secondary mappings in and primary mapping in single-idx .Of the 17,146 reads with disparate mappings, 50.5% had higher DP alignment scores for thesingle index, while 42.9% had higher scores in the 16-part index (Pearson’s correlation = 0.93,Figures 4.7 and 4.8). This eﬀect was similarly observed for the MAPQ scores, with 15.2% and12.8%, respectively, suggesting that alignments are of marginally better quality when gener-ated with a single index. Again, these disparate mappings are largely composed of repetitiveand viral sequences (Figure 4.9). when the 2,748 overlapped reads were removed from the103HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENT

No repeatsLINESatelliteSimple repeat No repeatsDNALINELow complexityLTRLTR?RCRetroposonSatellitescRNASimple repeatSINEsnRNAtRNAUnknown ace

MAPQ nu m be r o f m app i ng s ( l og sc a l e ) dynamic programming alignment score nu m be r o f m app i ng s ( l og sc a l e ) MAPQ nu m be r o f m app i ng s ( l og sc a l e ) dynamic programming alignment score nu m be r o f m app i ng s ( l og sc a l e ) bdf unique to single-idx unique to 16-part-idx-merged Figure 4.5:

Features of alignments that were uniquely mapped in single (left col-umn) and multiple (right column) partition indexing strategies. (a) and (b) are the distribution of mapping quality scores (MAPQ). (c) and (d) are the distribution of dy-namic programming alignment scores. (e) and (f) are the proportion of reads that map to regions of thehuman genome annotated as repeat elements (Repeat masker track of GRCh38 UCSC genome browser).

Genome browser screenshots of alignments unique to the 16-part indexthat do not overlap annotated repeats.

Out of the 281 alignments only found in with no overlap with repeat regions, the two highest scoring alignments were found in the regions(a) chr3:185,453,170-185,456,442 and (b) chr11:14,160,389-14,161,287 respectively. The screen shots are fromthe Golden Helix GenomeBrowse [http://goldenhelix.com/products/GenomeBrowse]. The ﬁrst track of eachscreen shot shows the pile-up of the alignments for the genomic region. The second track shows the repeatelements from the UCSC Repeatmasker track for GRCh38. The third track visualises the relative sequencecomposition of the GRCh38 reference for the particular region (A - red, C - yellow, G - green and T - blue). score : single-idx score : 16-part-idx-merged0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 500000500100015002000 dynamic programming alignment score nu m be r o f m app i ng s mapq : single-idx mapq : 16-part-idx-merged0 20 40 60 0 20 40 600500010000 MAPQ nu m be r o f m app i ng s a b Figure 4.7:

Distribution of (a) mapping quality and (b) dynamic programmingalignment score for the disparate mappings (diﬀerent by at least one base posi-tion) between single-idx (left) and . Disparate mappings here refer to the 17,146 aligned NA12878 reads with diﬀerent primary mappings (diﬀerentby at least one base position) between the two partition strategies.

To evaluate how chimeric reads will be aﬀected by aligning them to partitioned indexes, wetested this case on an ultra-long (473kb) chromothriptic Nanopore read from a patient-derivedliposarcoma cell line [173]. Chromothripsis is a genetic phenomenon often associated with can-cer and congenital diseases. It is caused by several rounds of breakage-fusion-bridge, whichproduce complex and localised genomic rearrangements in a relatively short segment of DNA.The single-idx produced 41 (36 primary + 5 secondary mappings) mappings (Figure 4.11a).However, part-idx-no-merge (16 partitions) produced 688 (608 primary + 80 secondary) map-pings (Figure 4.11b), while mapping with part-idx-merged resulted in 47 (42 primary + 5secondary) mappings (Figure 4.11c).In single-idx and part-idx-merged , 34 mappings were the same. Interestingly, there were 7mappings unique to single-idx and 6 unique to part-idx-merged (Table 4.3). All 7 alignmentsunique to single-idx map to the centromeric region of the chromosome 11 (Figure 4.11d),which is composed of large arrays of repetitive DNA (also known as satellite DNA). The106.2. RESULTS score : single-idx sc o r e : - pa r t - i d x - m e r ged r =0.9285276 Figure 4.8:

Scatter plot of alignment scores of single-idx vs fordisparate alignments (diﬀerent by at least one base position). The scatter plot contains 17,146 points representing each disparate mapping - diﬀerent by at least one baseposition (Pearson’s correlation (r) of 0.9285). The x and y axes are in log scale. 50.5% had higher dynamicprogramming alignment scores for single-idx , while 42.9% had higher scores in the . alignments that are unique to part-idx-merged map to simple repeats (e.g. GAGAGAGA). In addition to comparable quality of alignments, using a partitioned index yields impressivereductions on peak memory usage during indexing. About 7.7GB of memory is required to107HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENT pe r c en t age o f m app i ng s No repeatsDNADNA?LINELow complexityLTRLTR?RCRetroposonRNArRNASatellitescRNASimple repeatSINEsnRNAsrpRNAtRNA pe r c en t age o f m app i ng s chr1chr1_KI270709v1_randomchr10chr11chr12chr13chr14chr14_KI270725v1_randomchr15chr16chr16_KI270728v1_randomchr17chr17_KI270729v1_randomchr19chr2chr20chr21chr22chr22_KI270733v1_random chr3chr4chr5chr6chr7chr8chr9 chrEBV chrUn_GL000195v1chrUn_GL000214v1chrUn_GL000220v1chrUn_KI270744v1chrUn_KI270755v1chrXchr1_KI270713v1_randomchr15_KI270727v1_randomchr22_KI270735v1_randomchrUn_KI270751v1 a b Figure 4.9:

Distribution of genomic features of disparate alignments (diﬀerent byat least one base position). (a)

The distribution of repeat diversity in the disparate alignments (diﬀerent by at least one base position)between single-idx and . ( b ) Distribution of genomic targets for the 456 (left) and 472(right) disparate alignments (that did not overlap with annotated repeat regions) between single-idx and , respectively. hold a single reference index, whereas only 1.5GB is needed for a partitioned index with 16parts (Figure 4.12a). Peak memory usage can be further reduced by distributing or ’balanc-ing’ chromosomes across partitions based on their size (see Materials and methods). Theseindexing approaches combined with a mini-batch size between 5-20 Mbases (Minimap2 pa-rameter K ) enables alignment of long reads to the human genome with less than 2GB of RAM.Although generating an index ab initio requires more memory than loading a pre-built one,this only needs to be done once for a given reference and can be performed a priori , if required.The reduced peak memory usage of a partitioned index comes with an inherent sacriﬁce in108.2. RESULTS mapq : single-idx mapq : 16-part-idx-merged0 20 40 60 0 20 40 60030006000900012000 MAPQ nu m be r o f m app i ng s score : single-idx score : 16-part-idx-merged0 10000 20000 30000 40000 50000 0 10000 20000 30000 4000050000050010001500 dynamic programming alignment score nu m be r o f m app i ng s pe r c en t age o f m app i ng s No repeatsDNADNA?LINELow complexityLTRLTR?RCRetroposonRNArRNASatellitescRNASimple repeatSINEsnRNAsrpRNAtRNA pe r c en t age o f m app i ng s chr1chr1_KI270709v1_randomchr10chr12chr14chr14_KI270725v1_randomchr15chr16chr16_KI270728v1_randomchr17chr17_KI270729v1_randomchr19chr2chr20chr21chr22chr22_KI270733v1_random chr3chr4chr5chr6chr7chr8chr9 chrEBV chrUn_GL000195v1chrUn_GL000214v1chrUn_GL000220v1chrUn_KI270755v1chrXchr1_KI270713v1_randomchr15_KI270727v1_randomchr22_KI270735v1_randomchrUn_KI270751v1 score : single-idx sc o r e : - pa r t - i d x - m e r ged r=0.9220549 a bcd e Figure 4.10:

Statistics and genomic features of disparate mappings (mapping posi-tions not overlapping at least by 10%) between single-idx and

Out of the 17,146 mappings that were diﬀerent by at least one base position, 14,398 did not even overlap by10% or more with their mapped locations on the reference. For those 14,398 mappings the distribution ofthe (a) mapping qualities and (b) alignment scores, (c) the scatter plot between the alignment scores, (d) thedistribution of repeat diversity and (e) the mapping location of the mappings that did not contain repeatswere almost identical to respective plots for disparate alignments –diﬀerent by at least one base position.

Read position (bases) R e f e r en c e po s i t i on ( ba s e s ) Read position (bases) R e f e r en c e po s i t i on ( ba s e s ) Read position (bases) R e f e r en c e po s i t i on ( ba s e s ) Scalechr11:

SINELINELTRDNASimpleLow ComplexitySatellite 1 Mb hg3853,000,000 54,000,000Repeating Elements by RepeatMasker chr11 (p11.11-q11)

Scalechr5:SINELINELTRDNASimpleLow ComplexitySatellite 100 bases hg3861,332,500Repeating Elements by RepeatMasker chr5 (q12.1) a bc de

Figure 4.11:

Alignment of an ultra-long Nanopore read from a chromothripticregion.

Mapping coordinates in the entire human reference genome (y-axis) in function of the position in the read,showing where sub-sequences of the chimeric read map to in the genome for ( a ) single-idx , ( b ) part-idx-no-merge , and ( c ) part-idx-merged . The y-axis begins with chromosome 1 at 0 and ends with chromosome X,Y, and the mitochondria at the top. The length of rectangles along the x-axis are in the correct scale to thelength of the read. However, the length along the y-axis are exaggerated to a ﬁxed value so that it is clearlyvisible. In ( a ) and ( c ), the areas with dotted circles contain the diﬀerences between unique mappings for eachalignment strategy. Circled regions in ( a ) map to a genomic locus harbouring the satellite repeat displayedin ( d ). Out of the 6 unique mappings in ( c ), the segment with the highest mapping quality (6) maps to thesimple repeat containing region displayed in ( e ). ) B G ( y r o m e m l a i t n e d i s e r k a e p number of index par (cid:415)(cid:415)(cid:381) nswithout chr balancing with chr balancing (cid:415) (cid:373) e ( m i nu t e s ) number of index par (cid:415)(cid:415) ons chr balancing index buildingindex concatena (cid:415)(cid:381)(cid:374) a bc (cid:415) (cid:373) e ( m i nu t e s ) number of index par (cid:415)(cid:415) ons indexing mapping merging Figure 4.12:

Peak memory usage and runtime for a partitioned index of the humangenome. ( a ) Peak memory usage in function of the number of partitions for Ab initio index generation (left) and loading apre-built index (right). Dark bars represent memory usage when performing chromosome balancing as describedherein, whilst light bars represent default iterative partition distribution as implemented in Minimap2. ( b )Detailed runtime metrics for index building across two computational systems. System 1 is a laptop withﬂash memory (Intel i7-8750H processor, 16GB of RAM and Toshiba XG5 series NVMe SSD) while system 2 is a workstation with a mechanical hard disk (Intel i7-6700 processor, 16GB of RAM and Toshiba DT01ACAseries HDD). The total indexing time has been broken down into three steps; chr balancing , index building and index concatenations . Chr balancing includes the overhead for chromosome sorting, partitioning and writingeach partition to a separate ﬁle. ( c ) Runtime for base-level alignment (left) and block/locus level mapping(right). System 1 and 2 are as described in ( b ). Alignment was performed on the NA12878 data (see Materialsand methods) with the map-ont pre-set in Minimap2 using 8 threads. Runtime statistics are composed of indexing (index generation including the overhead), mapping (total time for aligning reads to each partition),and merging using the method described herein. Runtimes include ﬁle reading and writing. Mappings of the chromothriptic read which are diﬀerent in single-idxand 16-part-idx-merged

RefName RefStart RefEnd ReadStart ReadEnd Strand MAPQ Type chr11 51616896 51619171 228106 230227 + 0 Primarychr11 51658200 51659147 229344 230227 + 0 Secondarychr11 51735238 51735692 229848 230263 + 0 Secondarychr11 51913079 51916719 226802 230227 + 0 Secondarychr11 53527640 53533417 226770 232447 + 0 Primarychr11 53696097 53697156 229848 230835 + 0 Secondarychr11 54005962 54006133 226668 226835 + 38 Primarychr5 61332327 61332440 156722 156833 + 0 Primarychr5 61332327 61332729 156586 156863 + 6 Primarychr6 125655477 125659733 371459 375714 + 0 Primarychr6 1610948 1611924 156014 156860 + 1 Primarychr8 60678764 60678812 156722 156770 + 0 Secondarychr8 60678764 60678812 156806 156860 + 0 SecondaryOnly in single-idx Only in 16-part-idx-merged

The alignments which were only found in the single-idx mapped to locations in the rangechr11:51616896-54006133. The ones unique to mapped to repetitive regions in chr5, chr6and chr8. chr5:61332327-61332729, chr6:1610948-1611924 and chr8:60678764-60678812 contained simplerepeats. chr6:125,655,477-125,659,733 had simple repeats,SINE repeats and LTR repeats computing speed. Alignment requires signiﬁcantly more time than the balancing, indexingand merging steps when generating an index ab initio , which we observed to be relativelyconstant across diﬀerent partition ranges (Figure 4.12b). Less than 10% of the total computetime (5.7h) for base-level alignment is dedicated to balancing, indexing, and merging whenusing mapping to 16 partitions with eight CPU threads and a mechanical hard drive. Whenusing ﬂash memory, the overheads have very minor impact (3% of the total compute time).Chromosome balancing, for instance, required less than 1 minute and merging required lessthan 2 minutes.We observed that the alignment time increased less than linearly with the number of parti-tions in the index (Figure 4.12c). Since our motivation is to reduce the memory requirements112.3. DISCUSSIONof mapping to large references, this will inevitably impact speed. However, this also facilitatesparallelisation of alignments, where several small index partitions can be distributed acrossan array of low-memory devices (e.g. microcomputing boards such as Raspberry Pi). It alsoenables the use of mobile computing devices, such mobile phone or inexpensive laptops, whichwould otherwise be impossible. Considering that ONT’s MinION sequencer generates about80% of all data in the ﬁrst 24h, using 16 partitions would enable real-time mapping on a systemwith 2 GB of RAM in parallel to data acquisition, whereas a system with 4GB RAM wouldonly require 4 partitions and less than 1h of compute to align a typical ONT MinION dataset.

This work details two ways to reduce memory requirements for performing alignments onlarge genomes, or collections of genomes, using Minimap2. By tuning alignment parameters,peak memory usage can be lowered marginally, although with non-negligible impact to theaccuracy of alignments. We demonstrated this eﬀect by sequencing and mapping a diverseand representative set of synthetic spike-in controls, which can be used as a ground truth toassess sequencing and alignment accuracy. However, these data were not used to benchmarkthe alignment accuracy of Minimap2, but to demonstrate the comparative and relative impactof alignment parameters on memory usage. In this regard, we show that partitioning a largereference into smaller indexes upstream of an appropriate merging process drastically reducesthe peak memory usage without compromising alignment accuracy.Previous studies have described indexing strategies to improve computational eﬃciency. DIDA[169] and DREAM-Yara [170] use bloom ﬁlters to distribute sequencing reads to the mostappropriate index partitions. However, these works employ methods dependent on indexes113HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENTgenerated with the Burrows-Wheeler transform algorithm—which are ideal for short, lessnoisy reads generated by second generation sequencing platforms. These strategies focus onindexing enhancements centred on reducing the alignment search space by delivering reads tothe most suitable index partition. Our work focuses on multi-index alignment merging, whichis independent and complementary to these strategies. We reveal the problems derived frommappings to multiple partitions such as the accuracy of mapping reads and the overestimationof the mapping quality. We show that our merging technique circumvents those issues by theanalysis of mapping qualities and the use of independent controls. The partitioned referenceapproach has also been previously used for reducing the memory usage of BWA, a popularshort read alignment program [151]. However, the ﬁnal output consists of the indiscriminateconcatenation of the alignments from all the partitions, raising several of the caveats exposedin Figure 2. We have demonstrated that performing appropriate merging of alignment outputis required to eliminate many mapping artefacts, thus improving overall accuracy.We also showed that part-idx-merged can provide a better result than a simple strategy ofﬁltering out results with low mapping quality in part-idx-no-merge . This is supported by theresults from synthetic reads, where the accuracy of alignments with mapping quality 60 in part-idx-no-merge is lower than those from part-idx-merged . Furthermore, a simple strategyto remove all short mappings from part-idx-no-merge is also less than ideal. In fact,

Paftools (which was used for evaluating the synthetic read alignments) considers the longest primarymapping when multiple primary mappings exist to assess alignment accuracy.However, part-idx-merged can sometimes generate non-identical alignments to that of single-idx . This is like a consequence of slight variations in highly abundant k-mers observed whenthe index is build. Overall, this aﬀects only a few reads which would nonetheless have lowmapping qualities–an issue that has previously been reported by the author and users ofMinimap2 (see the public code repository associated with Li, H. [101]. Further, the reported114.3. DISCUSSIONalignments might diﬀer in long low-complexity regions, as Minimap2 may generate suboptimalalignments in long low-complexity regions (see supplementary materials of Li, H. [101]).Although a partitioned index reduces peak memory usage, the runtime is proportionatelyhigher. This is because all the reads should be repeatedly mapped to each partition of thereference. However, this strategy lends itself well to distributed computing, in particular whenmany smaller, less expensive computing devices are available.A limitation of this method also lies in the maximal number of partitions an index can besplit into, which currently depends on the longest chromosome or contig. We have not yetinvestigated the impact of splitting chromosomes into fragments, although we anticipate thiswould not drastically aﬀect results (as exempliﬁed from the chromothriptic read exampleabove). Furthermore, we have not tested the impact of this strategy for RNA sequencing readalignment, which implements diﬀerent alignment scoring metrics.In addition to capability of mapping long reads to large genomes on devices with a smallmemory footprint, our extension to Minimap2 could potentially be useful for the followingapplications:•

Mapping to huge reference genome databases . Meta-genomic databases can be hundredsof gigabytes in size. Hence, holding the index for the whole database would be challeng-ing even for high-speciﬁcation servers. Especially when multiple species with similargenomes are present, an accurate mapping quality with correct ﬂags, headers, and re-duced output ﬁle sizes is always appreciated. Alternatively, mapping genome assemblycontigs, or a select amount of long reads, to a large public sequence repository (akin toa BLASTN nucleotide databse query) could beneﬁt from our approach. However, the115HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENTeﬀect of merging output from such large queries has yet to be investigated.•

Mapping with a lower window size for increased sensitivity . Minimap2 runs on a de-fault minimiser window size of 10. However, reducing this value improves the mappingsensitivity, but increases the memory consumption. For application where high sensi-tivity may be preferred, for instance when confronted low coverage sequencing data, ourmethod can be beneﬁcial.While preparing this manuscript, our method was integrated into the source code of the orig-inal Minimap2 software repository. In Minimap2 version 2.12, the option --split-preﬁx canbe used to align to a partitioned index. The developer of Minimap2 has expanded our im-plementation to support paired end short reads and multi-threading for the merging process.The original version we implemented for conducting the above experiments is available inthe associated github repository [174] and can be useful for understanding the underlyingalgorithm. The partitioned index functionality can be invoked with the option --multi-preﬁx .Instructions to run the tools are detailed in the Supplementary Note A.3

For measuring peak memory usage and runtime, publicly available NA12878 Nanopore reads[56] were aligned to the human genome reference (GRCh38) with Minimap2 [101]. Peak mem-ory usage and runtime were measured by using the GNU command line time utility with the -v option. 116.4. MATERIALS AND METHODSSensitivity and error rate calculations for diﬀerent window sizes (Minimap2 parameter w ) wereperformed using Sequins, synthetic human genome spike-in controls and synthetic PacBioreads (see below). By reversing (not complementing) the sequences from regions of interest,these spike-in controls reproduce most features of the human genome, including nucleotide fre-quencies, somatic variation, low-complexity regions, and repeats. Given their chiral or ‘mirror’design, Sequins do not align to the native reference sequence but will align to a mirror copyof the human reference genome. They can thus be used to benchmark alignment accuracywhen spiked-in to a normal sample, although they were sequenced in isolation for this study.The particular Sequins design we employed was unpublished at the time this manuscript wasprepared (Deveson et al., under review), but it is conceptually similar to what is reported byDeveson et al. [171]. 1 µ g of Sequins DNA was sequenced on a ONT R9.4.1 ﬂow cell, usingthe LSK108 sample preparation kit and the results were base called with ONT’s proprietaryAlbacore software (version 1.2.6). Reads were mapped to the reverse human genome, usingMinimap2 under the pre-set map-ont for diﬀerent window sizes.We leveraged the chiral design of Sequins to qualify any mapping to the normal referencegenome as a false positive. True positive Sequin alignments should display the exact map-ping positions on the mirrored human genome, as intended by their design. However, givenstochastic variations in sequencing (base calling idiosyncrasies, involuntary library fragmen-tation, sample degradation, etc) the primary mappings derived from the default window sizeparameter ( w = 10) in Minimap2 were used as a reference to assess the relative eﬀect andimpact of parameter tuning. Then, for a given window size:• Mismatching mappings refer to primary mappings that had diﬀerent positions to themappings with reference parameters;•

Missing mappings refer to primary mappings that were not observed in empirical align-ments, but were observed in alignments with reference parameters;117HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENT•

Extra mappings refer to primary mappings that were observed in empirical alignments,but were not observed in alignments with reference parameters.The above counts were expressed as a percentage of the total number of reads. The sumof mismatch and extra mapping percentages were taken as an approximation of the relativeerror rate. The relative sensitivity was approximated by subtracting the percentage of missingmappings from 100.

We extended the partitioned index approach of Minimap2 to eliminate alignment artefactsas described below. The index partitioning in Minimap2 is inherited from the ﬁrst version ofMinimap [115]. This feature is for ﬁnding long read overlaps to be used with assembly toolssuch as Miniasm [115]. As overlap computing requires all-vs-all mapping of reads, the indexis built for chunks of 4 Gbases (can be overridden with the -I argument) at a time, eﬀectivelypartitioning the alignment index to keep the maximum memory capped at around 27GB. Foreach part of the index, Minimap attempts to map all the reads. The concatenated alignmentsfrom all the parts is the ﬁnal output.We modiﬁed Minimap2 to serialise and store the software’s internal state during the alignmentprocess. The internal state is serialised in binary format to reduce disk usage. The internalstate includes: (i) mapped positions, chaining scores and other mapping statistics for eachalignment record; (ii) DP score, CIGAR string, and other base-level alignment statistics foreach alignment record (when base-level alignment is speciﬁed); and (iii) sum of region lengthof read covered by highly repetitive k-mers for each read (referred to as repeat length). Thesedata form the serialised binary ﬁles, one for each partition of the index.118.4. MATERIALS AND METHODSWhen an alignment process has completed, we simultaneously open all the serialised binaryﬁles together with the queried sequence ﬁle. For each queried read (or contig), the previouslyserialised internal states of all alignments for the given read (resulting from all the indexpartitions) are loaded into memory. If no base-level alignment has been requested, the align-ments are sorted based on the chaining score in descending order. Otherwise, the sortingis based on the DP alignment score in descending order. The classiﬁcation of primary andsecondary chains is re-iterated as implemented in Minimap2. This corrects the primary andsecondary ﬂags in the output. Then, the secondary alignment entries are ﬁltered based ona user requested number of secondary alignments, and the requested minimum primary tosecondary score ratio, eﬀectively removing spurious secondary alignments. If a SAM outputhas been requested, the best primary alignment is retained as the primary alignment and allother primary alignments are classiﬁed as supplementary alignments. An unaligned record isprinted only if the read is not mapped to any part of the index.The length of the read covered by repeat regions in the whole genome (repeat length) is one ofthe parameters required to estimate an ideal mapping quality (MAPQ). The MAPQ producedby Minimap2 is a globally computed heuristic that depends on a large number of parameters,including this repeat length. We estimate this global repeat length by taking the maximumof the previously serialised repeat lengths (for each partition of the index) for that particularread. The Spearman correlation between this estimated repeat length and the global repeatlength is 0.9961. In theory, it would possible to exactly calculate this value by serialising andstoring the positions of repeats within the read. However, as the MAPQ is itself an estimationand the accuracy of mappings was adequate in our initial tests, we simply took the maximum.Hence, the computed MAPQ during merging of a partitioned index is not exactly the same asfor a single reference index, but very similar overall. This computed MAPQ is more accuratethan a MAPQ computed only from the repeat length for a single part of the index.119HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENTMerging is performed in the order of input read sequences, and mappings for a particular readID will be adjacent in the output. As the serialised data are loaded into memory for each read(or a batch of few reads) at a time, the memory usage of merging is only a few megabytes.For a detailed explanation of the merging algorithm refer the supplementary Note A.1. The construction of partitioned indexes in Minimap2 (speciﬁed by the -I option) processesreference sequences iteratively, which does not distribute reference sequences (i.e. chromo-somes) evenly when using multiple partitions. We implemented a simple sorting and binningalgorithm to mitigate this eﬀect. First, a command line parameter describing the number ofdesired partitions is considered. Then, the reference sequences (or chromosomes) are sortedin descending order of size. Next, a chromosome is assigned to the bin (or partition) with thelowest sum of bases, and the sum of that bin is then incremented by the chromosome size.This eﬀectively distributes the chromosomes to roughly balanced partitions in O ( n log n ) timecomplexity–adding negligible overhead to the overall indexing process (Table 4.4). We outputthe reference sequences belonging to the each bucket in a separate ﬁle. Finally we launch theMinimap2 indexer on each ﬁle and concatenate the indexes. Refer to Supplementary NoteA.2 for a detailed explanation. This approach is available under misc/idxtools in the github repository [174] and the instructions to run the tool are detailed in the Supplementary NoteA.3. 120.4. MATERIALS AND METHODSTable 4.4: Detailed runtime for partitioned indexes number of partitions 1 2 4 8 16 1 2 4 8 16 chr balancing (min) 0.00 0.15 0.12 0.12 0.13 0.00 0.71 0.50 0.49 0.47index building (min) 1.02 1.06 1.11 1.21 1.38 1.95 1.87 1.79 1.53 1.58index concatenation (min) 0.00 0.16 0.12 0.13 0.14 0.00 1.20 1.22 1.20 1.25 total indexing (min) 1.02 1.37 1.35 1.47 1.64 1.95 3.77 3.51 3.22 3.29 index loading (min) 0.23 0.19 0.24 0.27 0.28 1.51 1.48 1.58 1.44 1.47mapping (min) 17.86 36.03 72.57 133.38 238.33 32.75 43.72 89.01 165.29 299.14merging (min) 0.00 0.88 0.90 0.93 1.07 0.00 7.27 11.69 21.39 33.43 total mapping (min) 18.10 37.10 73.71 134.58 239.68 34.27 52.46 102.28 188.12 334.05 index loading (min) 0.18 0.19 0.24 0.26 0.26 1.17 1.28 1.41 1.40 1.45mapping (min) 6.54 7.84 11.82 16.57 24.93 7.16 9.03 13.29 18.68 29.63merging (min) 0.00 0.27 0.27 0.25 0.27 0.00 0.63 0.23 0.34 0.61 total mapping (min) 6.73 8.31 12.33 17.08 25.46 8.33 10.95 14.92 20.42 31.69system 1 system 2Indexingmapping with base-level alignmentmapping without base-level alignment

System 1 is a laptop with ﬂash memory (Intel i7-8750H processor, 16GB of RAM and Toshiba XG5 seriesNVMe SSD) while system 2 is a workstation with a mechanical hard disk (Intel i7-6700 processor, 16GB ofRAM and Toshiba DT01ACA series HDD). Alignment was performed on the NA12878 Nanopore data withthe map-ont pre-set in Minimap2 using 8 threads.

All experiments were performed using the human genome as a reference (GRCh38 with noALT contigs). The scripts and tools written for performing the experiments are availableunder misc/idxtools/eval in the github repository [174].

Mapping accuracy was evaluated using synthetic long reads. We generated about 4 millionPacBio reads using PbSim [175] under "Continuous Long Read" mode (long reads with a higherror rate). The minimum, maximum and the mean read length were set to be 100 bases, 25kbases and 3 kbases respectively with a standard deviation of 2300. The minimum, maximumand the mean accuracy of bases were set to 0.75, 1.00 and 0.78 respectively with a standarddeviation of 0.02. The ratio between substitution:insertion:deletion was 10:60:30.121HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENTIn the context of parameter tuning (Figure 4.13), the reads were mapped using Minimap2with diﬀerent window sizes while keeping other parameters constant. Then the accuracyevaluation was performed using the

Mapeval utility in

Paftools —part of the Minimap2 soft-ware package—where a read is considered correctly mapped if the mapping coordinates of itslongest alignment overlaps with the true reference coordinates with an overlap length of 10%or higher. -6 -5 -4 -3 -2 -1 Error rate of mapped reads F r a c t i on o f m apped r ead s w=10w=15w=20w=25w=30w=35w=40w=45w=50 Figure 4.13:

Eﬀect of the window size parameter w on the error rate and sensitivityfor simulated reads. Eﬀect of window size on the proportion of mapped reads and the associated errorrate (log scale) for 4 million simulated long reads (see Materials and Methods). A single curve contains pointsfor each mapping quality threshold (MAPQ score), one point for each mapping quality threshold from 60(leftmost) to 0 (rightmost).

In the context of multi-part index accuracy, simulated long reads were aligned using Min-imap2 with single reference index ( single-idx ), partitioned index without merging ( part-idx-no-merge ) and partitioned index with merging ( part-idx-merged ). Partitioned indexes with122.4. MATERIALS AND METHODS2, 4, 8 and 16 parts were tested. For each instance, we evaluated base-level alignments (de-fault SAM output) as well as locus- or block-based alignment (default PAF output withoutCIGAR information). To evaluate alignment accuracy,

Mapeval utility in

Paftools was usedwith default options, which consider only the longest primary alignment for a read. However,Paftools assumes that all alignments for a particular read reside contiguously. Hence, for part-idx-no-merge , we ﬁrst sorted the alignments based on the read ID. The output from Paftoolscontains the accumulative mapping error rate and the accumulative number of mapped readsfor diﬀerent mapping quality thresholds [176]. The fraction of mapped reads is taken as ameasure of sensitivity.

We could not ﬁnd a suitable Nanopore simulator. Published Nanopore simulators exploredat the time of writing had issues such as dependence on Minimap2 (would cause a bias),unavailability of trained models for human genome, being unstable or unavailability of sourcecode. (For instance DeepSimulator [177] and NanoSim [178] are dependent on Minimap2,SNaReSim [179] code was not available.) Hence we used a a dataset from the publicly avail-able NA12878 sample (rel3-nanopore-wgs-84868110-FAF01132 [56]). The dataset had 689,781reads with about 5.5 Gbases. We aligned this dataset to the human genome using a 16-partindex with merging ( part-idx-merged ) and without merging ( part-idx-no-merge ) with base-level alignment. Then we compared those outputs by generating some alignment summarymetrics with the result from a single reference index ( single-idx ). We initially attempted toperform an extensive comparison using tools such as

CompareSAMs utility in

Picard [58] and qProﬁler utility in

AdamaJava [180]. They crashed probably because they are designed to beworked with short reads. Hence, we ﬁrst obtained simple summary metrics using samtools [57]together with custom Linux shell scripts. Then we performed an extensive read by read com-parison between the SAM outputs from single-idx and part-idx-merged using a custom tool123HAPTER 4. FEATHERWEIGHT LONG READ ALIGNMENTwritten in C. The tool sequentially reads two SAM ﬁles while loading all the alignments for aparticular read to the memory at a time. For a particular read, it compares and then outputsthe alignment entries when the mappings positions between the two sets are disparate or ifmapped only in one set (discordant). On these disparate and discordant mappings we used bedtools [181] to ﬁnd overlaps with the UCSC repeatMasker track.The same above NA12878 dataset was used to measure the runtime of partitioned indexes.The runtime and the peak memory usage were measured using the GNU time command lineutility.The ultra-long chromothriptic read was sourced from an unpublished patient-derived datasetgenerated in house (see Garsed et al [173] for more details on the cancer cell line). The datawas generated on a MinION MkI sequencer (MN16218) with MinKNOW version 1.1.17 on aﬁrst generation R9 ﬂowcell (MIN105, no spot-on loading, ﬂow cell ID FAD24075) using theSQK-RAD001 library preparation kit from ONT. The raw data for the read was live basecalled with MinKNOW 1.1.17 and produced an average fastq score of 7.8.

Aligning long reads generated from third generation high-throughput sequencers to large refer-ence genomes is possible on computers with limited volatile memory. Parameter optimisationalone cannot substantially reduce memory usage without considerably sacriﬁcing alignmentquality. Partitioning an alignment index, saving the internal state, and merging the output a posteriori substantially reduces memory usage. This strategy reduces the memory require-ments for aligning Nanopore reads to the human reference genome from 11GB to less than2GB, with minimal impact on accuracy. 124.6. DATA AVAILABILITY

The datasets generated and analysed during the evaluation are available in the ﬁgshare repos-itory [182] [ https://doi.org/10.6084/m9.figshare.6964805.v1 ].125HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT

Chapter 5

GPU Accelerated Adaptive BandedEvent Alignment for RapidComparative Nanopore SignalAnalysis

This chapter is available as a pre-print in bioRxiv under the CC-BY-NC 4.0 Internationallicense at [26]. An adapted version of this chapter is published in BMC Bioinformatics:

H. Gamaarachchi , C. W. Lam, G. Jayatilaka, H. Samarakoon, J. T. Simpson, M. A.Smith, and S. Parameswaran, “GPU Accelerated Adaptive Banded Event Alignment forRapid Comparative Nanopore Signal Analysis,” BMC Bioinformatics 21, 343 (2020). DOI: https://doi.org/10.1186/s12859-020-03697-x .126.1. INTRODUCTIONNanopore sequencing has the potential to revolutionise genomics by realising portable, real-time sequencing applications, including point-of-care diagnostics and in-the-ﬁeld genotyping.Achieving these applications requires eﬃcient bioinformatic algorithms for the analysis of rawnanopore signal data. For instance, comparing raw nanopore signals to a biological referencesequence is a computationally complex task despite leveraging a dynamic programming algo-rithm for Adaptive Banded Event Alignment (ABEA)—a commonly used approach to polishsequencing data and identify non-standard nucleotides, such as measuring DNA methylation.Here, we parallelise and optimise an implementation of the ABEA algorithm (termed f5c ) toeﬃciently run on heterogeneous CPU-GPU architectures. By optimising memory, computeand load balancing between CPU and GPU, we demonstrate how f5c can perform ~3-5 × fasterthan the original implementation of ABEA in the Nanopolish software package. We also showthat f5c enables DNA methylation detection on-the-ﬂy using an embedded System on Chip(SoC) equipped with GPUs. Our work not only demonstrates that complex genomics analy-ses can be performed on lightweight computing systems, but also beneﬁts High-PerformanceComputing (HPC). The associated source code for f5c along with GPU optimised ABEA isavailable at https://github.com/hasindu2008/f5c . Advances in genomic technologies have given rise to a new era in biomedical sciences, improv-ing the feasibility and accessibility of rapid species identiﬁcation, accurate clinical diagnostics,and specialised therapeutics, amongst other applications. Whole genome sequencing involves‘reading’ the entire DNA sequence of a cell, revealing the genetic variation that underlies bi-ological diversity and the onset of disease. A human genome encompasses two copies of ∼ Compute module(MinIT ® ) Nanopore sequencer(MinION ® ) A C u rr en t ( p A ) Samples (4000 Hz)

Open pore DNA strand

TACGTCGTAGATTAGCA...

Base calling

Conversion of raw signalto nucleotide sequence

TACGTCGT–AGATT––AGCA... ...TACATGCTTAGATATTAGCA... | | | | | | | | | | | | | | | |

Sequence alignment

Comparison of sequencing readto reference genome or sequence

TACGT M CTTAGATTTTAGCA...

Polishing

Error correction and detection of modiﬁed nucleotides (e.g. DNA methylation) using sequence alignment and raw signal data

Consumableflowcell DNAElectrostaticmembraneBiologicalnanopore

Read

Reference

Figure 5.1: Nanopore portable sequencer and associated data analysisThe latest generation (third generation) of sequencing technologies can generate ultra-longDNA ‘reads’ from single molecules in real-time. In particular, Oxford Nanopore Technologies(ONT) manufacture a pocket-sized sequencer called MinION (Fig. 5.1), a relatively inexpen-sive and portable sequencing device capable of sequencing in-the-ﬁeld (e.g. remote area withno network connectivity) or at the point-of-care (e.g. hospital, clinic, pharmacy).In contrast to ‘second’ generation sequencers, which produce highly accurate short readsthrough enzymatic synthesis of the complementary strand of DNA, nanopore sequencingmeasures characteristic disruptions in the electric current (referred to hereafter as raw signal )when DNA passes through a biological nanopore (Fig. 5.1). A consumable ﬂowcell containingan array of hundreds or thousands of such nanopores is loaded into the sequencing device(e.g. MinION), which is coupled to a generic (e.g. laptop) or dedicated (e.g. MinIT) computemodule to acquire sequence data and base-call (the process of converting the raw signal tonucleotide characters) in parallel.Nanopore sequencing oﬀers several beneﬁts over other technologies, including ultra-long reads(>1 Mbases), detection of non-standard DNA bases or biochemically modiﬁed DNA bases,and real-time analysis, at the expense of a higher error-rate, which is predominantly causedby the conversion of the raw signal into DNA bases via probabilistic models (referred to as‘base-calling’). To overcome base-calling errors, the raw signal can be revisited a posteriori (see the polishing step in Fig. 5.1). Such a posteriori ‘polishing’ can correct for base-callingerrors by aligning raw signal to a biological reference sequence, thus identifying idiosyncrasies128.1. INTRODUCTIONin the raw signal by comparing observed signal levels to expected levels at all aligned positions.This process can also reveal base substitutions (i.e. mutations) or base modiﬁcations suchas 5-methylcytosine (5mC), a dynamic biochemical modiﬁcation of DNA that is associatedwith genetic activity and regulation. Detecting 5mC bases is important for the study of DNAmethylation in the ﬁeld of epigenetics.A crucial algorithmic component of polishing is the alignment of raw signal–a time seriesof electric current to a biological reference sequence. One of the ﬁrst raw nanopore signalalignment is implemented in the popular tool

Nanopolish [104], which employs a dynamicprogramming strategy referred to as Adaptive Banded Event Alignment (ABEA) . ABEA isone of the most time consuming steps when analysing raw nanopore data. For instance, whenperforming methylation detection with Nanopolish , the ABEA step consumes ~70% of thetotal CPU time. Consequently, it is important to investigate strategies to reduce the runtimeof ABEA to improve the turnaround time of certain nanopore sequencing applications, suchas real-time polishing or methylation detection.In this study, we dissect the ABEA algorithm to optimise and parallelise its use on diversehardware platforms, including Graphics Processing Units (GPUs). Adapting this ABEAalgorithm for the GPU is not a straight forward task due to three main factors: (i) Readlengths vary signiﬁcantly (from ~100 to >1M bases), thus requiring millions to billions ofdynamic memory allocations—an expensive operation in GPUs. (ii) Ineﬃcient memory accesspatterns which are not ideal for the GPUs having relatively less powerful and smaller caches(compared to CPUs) result in frequent instruction stalls. (iii) Varying read lengths causeirregular utilisation of the GPU cores. Recently, deep learning based tools such as

Deepsignal and

Deepmod have been releasedfor methylation calling [39]. However, those tools have been developed using Python and aredependent on a large number of bulky libraries including Tensorﬂow. Thus, those tools arenot very practical to be optimised for embedded systems with limited resources. Conversely,Nanopolish that utilises traditional alignment and Hidden Markov Model based methods,developed using C/C++, was a good candidate to be optimised for embedded systems.129HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTWe overcome the above mentioned challenges by: (i) employing a custom heuristic-basedmemory allocation scheme; (ii) tailoring the algorithm and the GPU user-managed cacheto exploit cache-friendly memory access patterns; and, (iii) using a heuristic based work-partitioning and load-balancing scheme between the CPU and GPU.We demonstrate the utility of our GPU optimised ABEA by incorporating to a completelyre-engineered version of the popular methylation detection tool

Nanopolish . First, we re-engineered the original

Nanopolish methylation detection tool to eﬃciently utilise existingCPU resources, which we refer to as f5c . Then, we incorporated our GPU optimised ABEAalgorithm into the re-engineered f5c . We demonstrate how f5c enables DNA methylationdetection using nanopore sequencers in real-time (i.e. on-the-ﬂy processing of the output)by using an embedded System on Chip (SoC) equipped with a GPU. We also demonstratehow f5c beneﬁts a wide range of computing systems from embedded systems and laptops toworkstations and high performance servers.The key contributions of this paper are: (i) the ﬁrst example of GPU acceleration and optimi-sation of raw signal alignment algorithm; (ii) f5c , a re-engineered and optimised version of thepopular DNA methylation detection tool

Nanopolish ; and, (iii) real-time detection of DNAmethylation using a lightweight and portable embedded system (previously only possible onhigh-performance servers).In the rest of the paper, we discuss the background of nanopore sequencing and ABEAalgorithm in Section 5.2, related work in Section 5.3, methodology in Section 5.4, results inSection 5.5, followed by the discussion and future work in Section 5.6. The associated tool thatincludes the GPU-based acceleration is available at https://github.com/hasindu2008/f5c .130.2. BACKGROUND

Basic terms and concepts of DNA sequencing and data analysis are given in Section 5.2.1.Section 5.2.2 brieﬂy explains methylation calling, an example nanopore data analysis work-ﬂow. Section 5.2.3 explains the Adaptive Banded Event Alignment (ABEA) algorithm, thealgorithm which is optimised in this paper for execution on a CPU-GPU heterogeneous archi-tecture. In Section 5.2.4, a brief account of GPU architectures and the programming methodsfor GPUs.

The genome is a long sequence composed of four types of nucleotide bases: adenine (A),cytosine (C), guanine (G) and thymine (T). Nucleotide bases will be simply referred to as bases hereafter. The human genome is around 3.2 gigabases (Gbases) long and is composedof 23 pairs of chromosomes (46 chromosomes in total), where each chromosome is a singlemolecule of continuous deoxyribonucleic acid (DNA) polymer. The process of reading stringsof contiguous bases is called sequencing , and the resulting strings of bases are called reads .In order to be sequenced, DNA molecules must be extracted and puriﬁed from cells beforebeing biochemically prepared for sequencing. This library preparation process can fragmentchromosomes (especially large ones) into smaller segments–either intentionally or incidentally–which are ‘read’ by the sequencer. Given that samples contain multiple cells, and thus severaldistinct DNA molecules, and that sequencing may introduce errors, it is desirable to generateenough reads to cover a particular position several times. The average number of reads at agiven position is termed sequencing coverage . High coverage facilitates the characterisation ofgenetic variation and correct for errors. A human genome sequenced at around 20 × averagecoverage corresponds to around 64 Gbases of sequencing reads.131HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT DNA undergoes naturally regulated biochemical modiﬁcation through the addition of a methylgroup to certain bases. Methylation is reversible and can control the activity of a DNAsegment, such as turning the expression of genes on or oﬀ, without modifying the geneticcode itself—a process called epigenetic regulation. DNA methylation is dynamically regulatedduring normal biological development and in function of environmental factors; it plays animportant role in disease aetiology and clinical diagnostics [183–185]. Methylation of cytosine(‘C’) bases is of particular interest in human biology, where CpG dinucleotides ( ‘C’ basefollowed by a ‘G’ base) are dynamically methylated in normal development and disease [186–188].

Nanopore sequencing is a third generation sequencing technology that involves physical obser-vation of atomic properties of DNA fragments using a nanometer scale biological pore coupledto an ammeter (see Fig. 5.1). The pore acts as a bottleneck to generate characteristic dis-ruptions in ionic current (in the range of pico-amperes) that are indicative of the moleculespassing through the pore. The size and nature of the pore inﬂuence the measured instanta-neous current and how it is subsequently analysed. Oxford Nanopore Technologies (ONT)sequencing devices measure DNA strand passing through biological nanopores composed ofrecombinant (or ‘designer’) proteins at an average speed of ~450 bases/s while the current issampled and digitised at ~4000 Hz . The instantaneous current measured in ONT nanoporedepends on 5-6 contiguous bases [189]. The measured signal also presents stochastic noise dueto several factors, such as homopolymers (same base repeating multiple times) which produceconstant current levels, contaminants in the sample, entanglement of long DNA strands, de- these are typical values at present which may vary in the future132.2. BACKGROUNDpletion of ions, etc [190]. Additionally, the movement speed of the DNA strand throughthe pore can vary, causing the signal to warp in the time domain [190]. The raw signal isconverted into character representations of DNA bases (e.g. A,C,G,T) using artiﬁcial neuralnetworks, generating a typical accuracy >90% for single reads [55]. This conversion processis referred to as base-calling and the software tools that perform this conversion are referredto as base-callers . Please refer to [189] for a detailed discussion of ONT sequencing.An example of a raw nanopore signal is shown in Fig. 5.2a using blue coloured line. Assumethat the signal is generated from the DNA sequence GAATACGAAAATCATTA which passedthrough the nanopore. In this example, the instantaneous current of the signal is aﬀected bya string of 6 contiguous bases, known as a (or a k-mer in general). Let us assume thatthe annotation of the signal to the corresponding k-mers is known (the process of getting thisannotation is detailed in Section 5.2.2). The in the sequence and the correspondingsegments in the raw signal are marked using vertical grey lines in Fig. 5.2a. When the DNAsequence

GAATACGAAAATCATTA moves through the pore, the ﬁrst is GAATAC .Similarly, the subsequent are

AATACG , ATACGA , TACGAA , ...,

TCATTA . Trueannotation (depicted by dotted green coloured step function in Fig. 5.2a) corresponds to the ideal average level of current for each k-mer . These ideal average values are obtained usingthe pore-model provided by ONT, which is elaborated in Section 5.2.2. The red coloured stepfunction corresponds to an event —detailed in Section 5.2.2.To deduce the sequence from the k-mers , the base at the centre (3rd base) of each k-mer istaken, as shown on the bottom of Fig. 5.2a. For instance, we take A from GAATAC , T from AATACG , C from TACGAA and etc. Hence, we obtain a sequence

ATACGAAAATCA whichis a part of the original sequence

GAATACGAAAATCATTA . Note that the beginning andthe end of the sequence (GA at the beginning and TTA at the end) are clipped.133HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT

A T A C G A A A A T C A

DNA sequence : GA

ATACGAAAATCA

TTA G A A T A C AA T A C G A T A C G A T A C G AA A C G AAA C G A AAA AA A A T C AA A T C A AA T C A T A T C A TT T C A TT A G A A AA T k-merbase r a w s i gna l t r ue anno t a t i one v en t s (a) An example nanopore raw signal and events k-mer mean sd AAAAAA µ σ AAAAAC µ σ AAAAAG µ σ AAAAAT µ σ AAAACA µ σ . . .. . .. . .. . .TTTTTT µ σ (b) an example pore-model Figure 5.2: Illustration of a nanopore raw signal, events and pore-model

The length of the reads generated from nanopore sequencers can vary from several hundredbases to even more than 2 million bases. A typical sequencing run of a particular sample(which completes after 48-64 hours) generates millions of such reads. The distribution of theread lengths varies in function of DNA integrity, extraction protocols, and sample preparationmethods. Example distributions for three diﬀerent samples are shown in Fig. 5.3, where bothx and y axes are in logarithmic scale. The average read length of a sample typically fallsbetween 8-20 Kilobases. ligation rapid small 10 read lengths (log scale ) r ead c oun t ( l og sc a l e ) Figure 5.3: example nanopore read length distributions134.2. BACKGROUND

Once a nanopore read is base-called, the sequence is aligned to a reference sequence (see Fig.5.1). A reference sequence consists of a previously generated consensus sequence (such asthe human genome reference). Sequence alignment involves global optimisation algorithms toidentify the most similar target and to compare any diﬀerences between sequences. Comparedto biologically occurring variation in individual genomes (<1% diﬀerence to the reference),the error-rate of nanopore sequencing is relatively high (5-10%). Thus, sequence alignmentsderived from nanopore reads are distinct in nature from previous sequencing technologies (suchas highly accurate short reads). Consequently, unique analytic tools must be considered whenaligning such reads. Alignment tools such as Minimap2 [101] that employ a hash table basedgenome index followed by a base-level dynamic programming alignment step can successfullyalign long and noisy reads.

The base-space alignment discussed previously in Section 5.2.1.5 is followed by ‘polishing’, adownstream processing step which utilises both the base-space alignment results and the rawsignal (see Fig. 5.1). The polishing step reuses the raw signal to recover the lost biologicalinformation during base-calling. This polishing step can be to correct errors during base-calling or to detect modiﬁed nucleotide bases (eg: DNA methylation).Previous research has shown that identiﬁcation of genetic variants can be improved up to anaccuracy of more than 99% by using raw signal data from multiple overlapping reads [56,103].Thus, the downstream analysis that reuses raw signal data could correct for base-calling errors.It has also been shown that methylated C bases can be diﬀerentiated from non-methylated Cbases by the use of signal data, using algorithms such as the one implemented in the softwarepackage

Nanopolish [104]. Thus, the downstream analysis that reuses raw signal data could135HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTdetect modiﬁed nucleotide bases.Signal-space alignment is one of the crucial steps performed in these downstream analysessuch as error correction and modiﬁed base detection. This signal alignment step is describedin the context of modiﬁed base detection in the following sections.

As discussed above, important biological information is lost during base-calling. Some base-calling models may not accommodate methylated data, either because they are trained onunmethylated sequences, or because they abstract away non-canonical bases. Therefore, thesemolecules may be erroneously classiﬁed as unmethylated bases. The process of identifyingmethylation is known as methylation calling .As implemented in

Nanopolish , methylation calling requires: 1, raw signals; 2, base-calledreads; and 3, base-space alignment to a reference genome (output of the sequence alignmentstep described above). For a given read, the main steps for methylation calling are: 1, eventdetection; 2, signal-space alignment; and 3, Hidden Markov Model (HMM) proﬁling. Thesesteps are performed for each individual read in the data set.

Event detection is the time series segmentation of the raw signal based on sudden signal levelchanges. Each segment is called an event and is typically denoted using the mean ( µ ¯ x ),standard deviation ( σ ¯ x ) and the duration of the raw signal samples ( n ¯ x ) pertaining to theparticular segment. The red step function in Fig. 5.2a denotes such detected events by plottingthe mean value of the samples ( µ ¯ x ) corresponding to the segment. Note that in Fig. 5.2a,events (red line) roughly match to the true annotation (dotted green line), nevertheless, arenot exactly the same. Mostly, the signal has been over-segmented (eg: portion correspondingto k-mer CGAAAA has been segmented into 3 events) and seldom under-segmented (eg:k-mer

AAATCA ). 136.2. BACKGROUNDTo obtain the true annotation in Fig. 5.2a, the events detected in the event detection stepare aligned to a generic k-mer model signal. This generic k-mer model signal is derived fromthe base-called sequence and a pore-model provided by ONT. The pore-model correspondsto a table of all possible k-mers matched to their mean signal value and standard deviation(4 k-mers if k is 6, as shown in Fig. 5.2b) . For each 6-mer in the base-called read, thecorresponding entry in the pore model ( mean,sd ) is obtained and these mean,sd pairs formthe generic k-mer model signal. Nanopolish aligns the events from the event detection stepto this generic k-mer model signal by using the algorithm named

Adaptive Banded EventAlignment (ABEA) explained in Section 5.2.3.ABEA above produces the alignment between the events and the k-mers in the base-calledread. The base-space sequence alignment then is used to deduce which event corresponds toa given k-mer in the reference genome. Finally, this alignment between the events and thek-mers in the reference genome are subjected to Hidden Markov Model (HMM) proﬁling toidentify if a given base is methylated or not.

Modiﬁed versions of the SK (explained in section 2.2.2) algorithm are used for event-spacealignment as exempliﬁed in

Nanopolish and is referred to as

Adaptive Banded Event Alignment(ABEA) . In ABEA, the events are aligned to the k-mers of the base-called read (as stated inSection 5.2.2). As typically there are many more events than k-mers (usually by a factor 1.5-2)due to the frequent over-segmentation of events (discussed in Section 5.2.2), event alignmentis even more diﬃcult than base-space long read alignment if performed with static bandingaround the diagonal. Thus, an adaptive band is essential for event alignment. there can be other values in addition to mean and standard deviation, which are notrequired for our methylation calling 137HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTThe scoring function for signal alignment uses a 32 bit ﬂoating point data type, as opposedto 8-bit integer data type in sequence alignment. Furthermore, the signal alignment scoringfunction that computes the log-likelihood (which we elaborate shortly) is computationallyexpensive.A simpliﬁed example of ABEA is shown in Fig. 5.4a. In Fig. 5.4a the horizontal axisrepresents the events (results of the event detection step) and the vertical axis represents the ref k-mers (k-mers of the base-called read). The dynamic programming table (DP table) inFig. 5.4a is for 13 events, indexed from e -e vertically, and the ref k-mers, indexed fromk -k horizontally. As mentioned previously for computational and memory eﬃciency, onlythe diagonal bands (marked using blue rectangles) with a band-width of W (typically W =100for nanopore signals) are computed. The bands are computed along the diagonal from top-left( b0 ) to bottom-right ( b17 ). Each cell score is computed in function of ﬁve factors: scoresfrom the three neighbouring cells (up, left and diagonal); the corresponding ref k-mer; and,the event (shown for the cell e , k via red arrows in Fig. 5.4b, details of the computation isexplained later). Observe that all the cells in the n th band can be computed in parallel as longas the n − th and n − th bands are computed beforehand. To contain the optimal alignment,the band adapts by moving down or to the right as shown using blue arrows in Fig. 5.4a.The adaptive band movement is determined by the Suzuki-Kasahara heuristic rule [102].Algorithm 8 summarises the ABEA algorithm used in Nanopolish [104] and is explained withthe aid of the example in Fig. 5.4a.The input to the Algorithm 8 are: 1, ref (the sequenced read in base-space—eg:

GAAT-ACG... ); 2, events (the output of the event detection step mentioned in Section 5.2.2); and 3, model ( pore-model —Fig. 5.2b). As mentioned in Section 5.2.2, the ABEA algorithm (Algo-rithm 8) attempts to align the events to the generic signal model (produced with the use of ref and the model ) and outputs the alignment as event-ref pairs. The algorithm requires threeintermediate arrays, namely score (2D ﬂoating point array), trace (2D byte array) and ll (1D138.2. BACKGROUNDpointer array) to formulate the intermediate state during alignment computation, which is theDP table shown in Fig. 5.4a). Note that, ll stands for lower-left, which holds the coordinateof the start point of the band.The initialisation of the ﬁrst two bands ( b0 and b1 ) in Fig. 5.4a is performed by line 20 ofAlgorithm 8. Then, the outer loop (starting from line 3) iterates through rest of the bandsfrom top-left to bottom-right of the DP table. The inner loop (lines 11-15) iterates througheach cell in the current band bi . To ensure that only cells within the DP table are computed,the loop counter j iterates from min_j to max_j , instead of 0 to W −

1. Lines 4-9 of Algorithm8 correspond to the movement of the band (corresponds to the blue arrows in Fig. 5.4a). Bandmovement is actuated by proper placement of the band in the static 2D arrays, score and trace via the array ll using the functions move_band_right and move_band_down .Line 12 of the algorithm performs the cell score computation (explained in detail later) andgenerates a score and a direction ﬂag for subsequent backtracking, which are henceforth storedin the arrays score and trace . When all the cells in the DP table are computed, the ﬁnaloperation is to ﬁnd the actual alignment ( event-ref pairs) through the backtracking operation(line 17 of Algorithm 8 and red trace-back arrows in Fig. 5.4c), which uses both the cell scoresand the direction ﬂags stored in trace .The compute function (called at line 12 of Algorithm 5.4a) is elaborated in Algorithm 9.A number of heuristically determined constants suitable for Nanopore data, which are usedduring subsequent calculations are listed at the beginning of this algorithm. The ﬁrst stepof this algorithm is the computation of lp_emission , a log probability value (likelihood ofthe particular signal event being the particular ref k-mer), performed using the functionelaborated in Algorithm 10. This computed lp_emission is used in lines 4-5 of Algorithm 9along with the heuristically determined constants ( lp_skip,lp_stay,lp_step ) to compute threescores from the diagonal, left and up ( score_d, score_u, score_l ). The maximum of the threescores and direction from which the max score came (ﬂags pertaining to diagonal, up or left)139HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTare returned as outputs from this function. The line 3 of Algorithm 9 refers to accessing thescores of the upward, left and diagonal cells which was previously mentioned with respect tocell e ,k and the red arrows in Fig. 5.4b.The log probability computation in Algorithm 10 involves ﬂoating point log probability com-putations. For the k-mer at the speciﬁc ref position, the pore-model table (Fig. 5.2b) isaccessed to obtain the corresponding model values. This model_kmer (mean and the stan-dard deviation of the particular model k-mer) and the mean value of the event is used forthe log probability computation as shown in the Algorithm 10. Note that for event alignmentneither the standard deviation or the duration of the event are used.The above elaboration covers the ABEA algorithm to a suﬃcient enough level to explain ourGPU implementation and optimisations. Therefore, implementation details of checking outof bound array accesses and the backtracking process were not discussed. Furthermore, theconcept of the ‘trim state’ and ‘event scaling’ were not discussed as the control ﬂow of thealgorithm are not aﬀected by them. Thus, those details not vital for the elaboration GPUimplementation. However, for the sake of completeness, a brief account of this ‘trim state’and ‘read-model scaling’ are given below.The raw signal may contain samples at the beginning/end that may be ignored by the base-caller and hence does not contribute to the base-called sequence. These samples may be openpore signal immediately before or after the DNA molecule is detected (i.e. the electric currentwhen nothing is in the nanopore), or perhaps part of the adaptor (molecules bounds to theends of the DNA molecules to enable sequencing). The ‘trim states’ allow the alignment toignore these samples, since such samples should not be considered to be part of the base-calledread.Due to reasons such as slight variations between diﬀerent nanopores and characteristic changesof the same nanopore with time, an event will not directly match the pore-model in Fig.5.2b [191]. Therefore, to account for these variations either the events or the pore-model Nanopolish , two scaling parameters namely shift and scale are estimated on a per-read basis, prior to ABEA algorithm, using a ‘Methodof Moments’ approach [191]. Then, during ABEA, the pore-model mean values are scaledusing these two parameters. The scaling should be performed at line 5 of Algorithm 10 as µ ← model _ kmer.mean × scale + shif t instead of directly assigning model _ kmer.mean to µ . Graphics Processing Units (GPUs) were originally designed as co-processors for graphics pro-cessing and rendering. Graphics processing and rendering algorithms involve pixel-wise opera-tions which expose ﬁne-grained parallelism, thus GPUs consists of hundreds of compute coresto perform parallel processing. Eventually, the concept of general purpose graphics processingunits (GPGPU) emerged where the GPUs were exploited to accelerate compute intensive, yethighly parallelism portions of general purpose algorithms. GPUs are quite popular in scien-tiﬁc computations due to the signiﬁcant speedup when used for common matrix manipulationwhich contains ﬁne-grained parallelism. From around a decade ago, GPUs which are explicitlydesigned for high performance computers are available (e.g., Tesla GPUs from NVIDIA).GPUs are of Single Instruction Multiple Data (SIMD) architecture (or more accurately SingleInstruction Multiple thread, as stated by NVIDIA), where multiple threads run the samestream of instructions in parallel yet on diﬀerent data. Conversely, CPUs are of MultipleInstruction Multiple Data (MIMD) architecture, where each thread runs its own instructionsequence and own data stream, independent of the others. GPUs have hundreds or eventhousands of processing cores while a CPU would maximally have a few dozen cores. However,the GPU cores are relatively less complex (fewer instructions, smaller caches, no sophisticatedbranch prediction units etc.) and run at a lower clock speed when compared to a CPU. Due tothese signiﬁcant diﬀerences between CPU and GPU architectures, serial algorithms designed141HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTand developed for the CPUs are not suitable for execution on GPUs. Such algorithms have tobe adapted and parallelised in a way that the GPU architectural features are eﬃciently used.NVIDIA provides a programming model/framework for programming their GPUs for generalpurpose computations, called Compute Uniﬁed Device Architecture (CUDA). CUDA includesCUDA C/C++ (extended C/C++ syntax) and an Application Programming Interface (API)to provide a platform to write programs for the NVIDIA GPU. We used this CUDA C/C++for our GPU implementation of the Adaptive Banded Event Alignment algorithm.We will now brieﬂy give GPU/CUDA related terms. Readers are advised to refer to [192]and [193] for further information.A GPU kernel is a function that is executed on a GPU. A GPU kernel is written from theexecution perspective of a single GPU thread. These GPU kernels will run in parallel, basedon the parameters speciﬁed with the function call, known as the thread conﬁguration . Thisthread conﬁguration in CUDA is an abstraction which employs a hierarchy of threads. Inthe thread hierarchy, a group of threads are known as a block . A group of blocks form a grid . Instances of a single kernel are executed in a single grid. Blocks and grids can be1 dimensional, 2 dimensional or 3 dimensional. The presence of this thread hierarchy letsthe programmer organise and map the threads conveniently to a grid. These logical threadswould be mapped to the hardware cores automatically by the underlying driver software andhardware.A thread block consists of one or more thread warps . A warp is a group of threads sharingthe same program counter. A data dependent conditional branch inside a warp causes thethreads to execute each code path while disabling threads that are not in the path, known as warp divergence . The warp divergence aﬀects the performance and should be minimised.The occupancy is the percentage of the number of active warps to the maximally supportedwarps on the GPU. A lesser occupancy leads to under utilisation of GPU resources. Thus, a142.3. RELATED WORKhigher occupancy is preferable for better utilisation of GPU resources.GPUs also employ a memory hierarchy. Relatively larger but slow Dynamic Random AccessMemory (DRAM) that forms the lowest level in the memory hierarchy is known as globalmemory . Global memory is typically allocated using cudaMalloc()

API function. Memoryallocated in this global memory can be exclusively accessed by all the threads in the grid.The next level in the memory hierarchy which is made of relatively fast, yet smaller SRAM iscalled shared memory . Shared memory is allocated on a per-thread-block basis and is sharedby all the threads in the block. Shared memory can be called user managed cache (moreaccurately a programmer managed cache) as the programmer is expected to identify and loadfrequently accessed data to the shared memory. In addition, there are one or more levels ofSRAM caches managed by the hardware. The registers are the fastest and highest in thehierarchy and are allocated by the compiler on a per-thread basis.The global memory can be easily saturated when hundreds of threads compete to access thememory at the same time. Thus, memory accesses should be batched such that contiguousthreads access contiguous memory locations. This process is referred to as memory coalescing and reduces global memory requests thus reducing the impact on performance compared toscattered memory accesses. Additionally, the programmer could utilise the shared memory toload and store frequently accessed data, which also reduces global memory traﬃc.

An algorithm to call methylation using the raw signal from ONT sequencers was introducedby Simpson et al. [104]. The associated C++ based implementation of this algorithm isa sub-module under the open source tool

Nanopolish . Nanopolish was designed to run onhigh-performance computers and is not lightweight or suitable for deployment on embeddedsystems. 143HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTThe signal-space alignment algorithm, termed

Adaptive Banded Event Alignment (ABEA) ,used in

Nanopolish is a customised version of the Suzuki-Kasahara alignment algorithm[102] for base-level sequence alignment. According to the best of our knowledge, neitherof these algorithms (ABEA or Suzuki-Kasahara) have GPU accelerated versions. The rootorigins of these algorithms are dynamic programming sequence alignment algorithms, such asSmith-Waterman and Needleman-Wunsch. A number of GPU accelerated versions for Smith-Waterman exist in previous research [134,194,195] [194] [195]. However, the Smith-Watermanalgorithm has a compute complexity of O ( n ) and is most practical when the sequences areshort, especially when millions of sequences need to be aligned. As nanopore sequencers canproduce reads >1 million bases long, computing the full DP table for such reads using SWwould require >10 computations and hundreds of gigabytes of RAM—and even more ifaligning raw nanopore signals.Heuristic approaches such as banded Smith-Waterman attempt to reduce the search space bylimiting computation along the diagonal of the DP table. While the approach is suitable forIllumina short reads, it is less so for noisy long nanopore reads as substantial band-width isrequired to contain the alignment within the band. The Suzuki-Kasahara algorithm uses aheuristic that allows the band to adapt and move during the alignment, thus containing theoptimal alignment within the band but allowing large gaps in the alignment. Modiﬁed versionsof the adaptive banded alignment algorithm are used for signal-space alignment, as exempliﬁedin Nanopolish . The band-width (width of the band) used for signal-space alignment is typicallyhigher (~100) compared to other banded algorithms used for sequence alignment. In addition,the scoring function for signal alignment uses a 32 bit ﬂoating point data type, as opposedto 8-bit integers in sequence alignment. Furthermore, the signal alignment scoring functionthat computes the log-likelihood is computationally expensive. Taken together, these reasonsmotivated us to consider using GPUs to speedup the computation of signal-space alignment.The portable compute module, MinIT, manufactured by ONT is composed of a NVIDIASoC [196] that exploits GPUs for performing live base-calling, which can perform base-calling144.4. METHODOLOGYat a speed of ~150 Kbases per second, thus keeping up with the MinION sequencer’s output.In addition, our previous work has optimised the popular

Minimap2 [101] sequence alignmenttool (which typically requires ~16GB memory) for reduced peak memory usage, enabling thesoftware to be executed on embedded processors [25]. The data processing steps required formethylation calling are thus possible to run on embedded processors, therefore supportingthe implementation of a portable, oﬄine DNA methylation detection application that wouldfacilitate such analyses in the ﬁeld.Load balancing between the CPU and GPU for heterogeneous processing has been explored forareas such as ﬂuid dynamics [197] and conjugate gradient method [198]. However, nanoporedata have diﬀerent characteristics compared to aforementioned applications which are pre-dominately based on matrices. Furthermore, the signal-space alignment algorithm is diﬀerentfrom linear algebra algorithms used in these ﬁelds. We exploit characteristics of Nanoporedata and algorithms to perform memory, compute and load balancing optimisations.

To optimise the performance on GPUs, we process a batch of reads (original source code pro-cesses a read at a time) at a time. Such batch processing minimises data transfer initialisationoverhead (between RAM and GPU memory); reduces the GPU kernel invocation overhead;and, allows parallelism which suﬃciently occupies all available GPU cores. The execution ﬂowfollows the typical GPU programming paradigm, which is elaborated in Algorithm 11. In Al-gorithm 11, gpu_alignment(...) refers to the GPU implementation of the Adaptive BandedEvent Alignment (CPU algorithm is elaborated in Algorithm 8). We present our methodologyin three steps: parallelisation and compute optimisations in Section 5.4.1; memory optimi-sation in Section 5.4.2; and, the resource optimisation through heterogeneous processing inSection 5.4.3. 145HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT

The GPU implementation of the Adaptive Banded Event Alignment (ABEA) algorithm isbroken into three GPU kernels. Breaking down into the three GPU kernels allows for eﬃcientthread assignment based on the workload type, synchronisation of all GPU threads (a GPUkernel execution is inherently a synchronisation barrier [192]) and minimising warp divergencecompared to a big all-in-one GPU kernel.The three GPU kernels are:• pre-kernel - Initialising the ﬁrst two bands of the dynamic programming table (corre-sponds to line 2 of algorithm 8) and pre-computing frequently accessed values by thenext GPU kernel;• core-kernel - The ﬁlling of dynamic programming table which is the compute intensiveportion of the ABEA algorithm (corresponds to line 3-16 of Algorithm 8 composed ofnested loop); and,• post-kernel - Performs backtracking (corresponds to line 17 of algorithm 8)

The pre-kernel initialises the ﬁrst two bands of the dynamic programming table (initialisationperformed at line 2 of Algorithm 8 on CPU). The pre-kernel also pre-computes the values ina data structure called kcache , a newly introduced data structure in the GPU implementationthat improves cache hits during the subsequent execution of the core-kernel .A simpliﬁed version of the pre-kernel is in Algorithm 12 and thread conﬁguration for theinvocation of the pre-kernel is in Fig. 5.5. Note that the GPU kernel is presented (as isalways the case) from the perspective of a single GPU thread in Fig. 5.5.146.4. METHODOLOGYEach cell in Fig. 5.5 represents a GPU thread denoted as t , where the subscripts x and y denotes the thread index along the x-axis and the y-axis respectively. The thread grid in Fig.5.5 is composed of n thread blocks, where n is the number of reads in the batch. Each threadblock contains WX threads where WX is the nearest upper ceiling multiple of 32 to the band-width W (band-width of the ABEA algorithm); i.e. W X = ( int ) W +3132 ×

32 For instance, if

W=100 , WX is 128. The reason for taking a multiple of 32 is due to performance attributedby a thread block size that a multiple of the warp size (warp size is 32 currently) [193] . Asshown in Fig. 5.5, a single thread block composed of WX threads is assigned to a single read.In the Algorithm 12, lines 2-3 get the thread index of the thread being executed, i.e. thethread indices denoted as x and y in Fig. 5.5. Line 4 obtains the memory pointers of theinput array ref ; intermediate arrays score and trace ; and the kcache , the use in which isexplained in the memory optimisation Section (Section 5.4.2).Lines 5-8 of Algorithm 12 initialises the ﬁrst two bands of the dynamic programming table(which was performed at line 2 of original CPU Algorithm 8). The kernel is in from theperspective of a single thread and thus a single cell is initialised by a single thread. Thecollective execution of all the threads in Fig. 5.5, eﬀectively sets a band for all the reads inthe batch in parallel, which is illustrated in Fig. 5.6. Note that, only the ﬁrst two reads areelaborated in Fig. 5.6, and in reality each thread block has a read assigned to it. In Fig. 5.6,each cell in band (marked as iteration 1) contains the index of the thread which performsthe initialisation at line 6 of Algorithm 12. Similarly, iteration 2 corresponds to line 7 ofAlgorithm 12.The if condition on line 5 of Algorithm 12 is to limit the threads to the width of the band W , a consequence of selecting W X which is a multiple of 32 (as stated previously). Note thatthere is a 1024 thread limit for a block [192] in current NVIDIA CUDA/GPU architecture,thus our implementation will only work for a maximum band-width of 1024. This limit ismore than suﬃcient for a typical W of 100 in ABEA.147HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTLine 10-11 of Algorithm 12 initialises the index of the lower left band which corresponds to line23-24 of Algorithm 8. Note that this initialisation is executed by one thread per read (threadid 0 along y-axis). Lines 13-16 in Algorithm 12 initialises kcache . As stated previously kcache is a newly introduced array for the GPU implementation to minimise random accesses to theGPU memory during the core-kernel and will be explained in Section 5.4.1.2. Note that, this kcache initialisation in line 13-16 is also executed by one thread per read (thread id 0 alongy-axis). The loop in 13-16 can be further parallelised; however, as the time spent on pre-kernel is comparatively negligible (see results), further parallelising this loop is superﬂuous. A simpliﬁed version of the core-kernel which ﬁlls the dynamic programming table in Fig.5.4a (corresponds to line 3-16 of the original Algorithm 8) is in Algorithm 13. This kernel isexecuted with the same kernel thread conﬁguration as pre-kernel in Fig. 5.5. Thus, a batchof reads are processed in parallel with a block of threads assigned to a single read in a similarway to that in pre-kernel (Fig. 5.6). The only diﬀerence in Fig. 5.6 for the core-kernel is thatthe third band to the last band are processed instead of the ﬁrst two bands.All the W cells in a given band (Fig. 5.4a) are computed by W number of GPU threadsin parallel (lines 26-30 of Algorithm 13), thus the inner loop of Algorithm 8 (lines 11 and15) is now no longer present. However, the outer loop of Algorithm 8 cannot be paralleliseddue to band n depending on n − n − __syncthreads ) in Algorithm 13prevent any data hazards due to multiple threads assigned to a single read.Another notable diﬀerence in the GPU implementation is the use of GPU shared memory148.4. METHODOLOGY[192] (user-managed cache or more accurately programmer-managed cache) for exploiting thetemporal locality in the memory accesses to the dynamic programming table ( n th band inFig. 5.4a is computed using bands n-1 and n-2 ). Shared memory is allocated for three bands(current, previous band and second previous) by line 6-7 of Algorithm 13 which are theninitialised at lines 9-10 of Algorithm 13. These initialised memory locations are used duringband direction computation (lines 14-21 of Algorithm 13) and the cell score computation(lines 27-28 of Algorithm 13), eliminating any accesses to the slow GPU global memory(shared memory-SRAM vs global memory-DRAM). The cell score is written to the globalmemory at the end of the iteration (line 32 of of Algorithm 13) as scores are later requiredfor backtracking. Finally, current, previous and second previous bands are set for the nextiteration (lines 33-36 of Algorithm 13).As stated under Section 5.4.1.1, the data structure kcache introduced to the GPU implemen-tation facilitates memory coalescing by minimising random memory accesses to the model array ( pore-model array in Fig. 5.2b). If kcache did not exist, access pattern by contiguousthreads in the core-kernel (shown for the iteration 5 of read 0) would look like in Fig. 5.7awhere accesses to the ref are shown in green colour arrows and the subsequent accesses tothe pore-model are in red colour arrows. The green arrows (relates to getting the k-mer atline 2 of Algorithm 10 in the CPU version) are spatially local and would facilitate memorycoalescing in the GPU. However, red arrows (relates to line 4 of Algorithm 10 in the CPUversion) to the model array are random accesses. Note that such random accesses would occurduring each iteration (iteration 3 to the last band iteration). Such multiple threads accessingrandom GPU memory locations degrade the performance due to smaller and less powerfulGPU caches (compared to CPU), for instance, 32KB pore model array is larger than 8KBGPU constant cache [192].These random accesses are eliminated by the kcache constructed in pre-kernel (stated underSection 5.4.1.1) which is then passed as an argument to the compute function at line 27 inAlgorithm 13). This kcache is then passed on to the log_probability_match function (at line149HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT2 of Algorithm 14) which is then used at line 4 of Algorithm 15. The construction of thecaches in the pre-kernel requires random accesses to the model as shown in Fig. 5.7b, whichhappens only once. However, this kcache is utilised by the core-kernel in every iteration andfacilitates memory coalescing (see green arrows in Fig. 5.7c which are spatially local accessesto the kcache by contiguous threads in iteration 5).It is noteworthy to mention that allocating one thread block per read is critical (in the kernelconﬁguration) to: use lightweight block synchronisation primitives __syncthreads (insteadof expensive kernel invocations as synchronisation barriers [192]); minimise warp divergence(otherwise the longest read in the thread block would consume the longest time which corre-sponds to the band ﬁlling loop); and, use shared memory per read (shared memory is allocatedper block). The backtracking operation performed by this post-kernel (one thread assigned to one read)does not expose ﬁne grained parallelism as in previous kernels and thus not ideal for theGPU. However, performing this on GPU is still advantageous when compared to transferringhuge intermediate arrays ( scores and trace —size in order of GB) from GPU to the RAM. Inaddition, no additional memory in the RAM is required, thus reducing peak RAM usage.Allocating one thread block per read (as in core-kernel to reduce warp divergence) is notideal for this post-kernel due to the lack of ﬁne grained parallelism (i.e. 1 block having 1thread), which results in reduced GPU occupancy (occupancy will be limited by the maximumthread blocks that can simultaneously reside in a GPU multi-processor). This is remediedwithout aﬀecting the warp divergence by allocating a large number of threads per block(eg: 1024) and then limiting only the ﬁrst thread in the warp (a warp is composed of 32contiguous threads [192] and thus thread with indices 0, 32, 64, 96 ... etc) to perform the150.4. METHODOLOGYactual computation (backtracking for a read).151HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT

Algorithm 8

Adaptive Banded Event Alignment

Input: ref[] : the base-called read (1D char array) model : pore-model (Fig. 5.2b) events[] : event table containing { µ ¯ x , σ ¯ x , n ¯ x } of each event—1D { ﬂoat , ﬂoat , ﬂoat } array Output: alignment[] : alignment denoted by a list of { event index , k-mer index }—1D { int , int } array Intermediate: score[][] : scores of the cells in banded area—2D ﬂoat array trace[][] : back-track ﬂags of the cells in banded area—2D char array ll_idx[] : {event index,k-mer index} for each band’s lower left cell—1D { int , int } array function align ( ref,model,events ) initialise_ﬁrst_two_bands(score,trace,ll_idx) . band b0 and b1 in Fig. 5.4a, see line 20 for i ← do . Iterate from b2 to b17 in Fig. 5.4a dir ← suzuki_kasahara_rule(score[i-1]) . score[i-1] is of the previous band if dir == right then ll_idx[i] ← move_band_to_right(ll_idx[i - 1]) . see line 28 else ll_idx[i] ← move_band_down(ll_idx[i - 1]) . see line 33 end if min_j,max_j ← get_limits_in_band(ll_idx[i]) . get index bounds in current band * for j ← min_j to max_j do . Iterates through each cell in band i s,d ← compute(score[i-1],score[i-2],ref,events,model) . see Algorithm 9 score[i,j] ← s trace[i,j] ← d end for end for alignment ← backtrack( score , trace . ll ) . the trace-back red arrows in Fig. 5.4c. end function function initialise_first_two_bands ( score,trace,ll_idx ) score[0,*], trace[0,*] ← −∞ , . Initialise ﬁrst band b0 score[1,*], trace[1,*] ← −∞ , . Initialise second band b1 ll_idx[0] ← { ei , ki } . ei = 1 and ki = − ** ll_idx[1] ← { ei , ki } . ei = 1 and ki = 0 in Fig. 5.4a ** score[0, si ] ← . si is 0 is Fig. 5.4a *** end function function move_band_to_right ( ll_previous ) ll_current.event_idx ← ll_previous.event_idx + 1 ll_current.kmer_idx ← ll_previous.kmer_idx end function function move_band_down ( ll_previous ) ll_current.event_idx ← ll_previous.event_idx ll_current.kmer_idx ← ll_previous.kmer_idx+1 end function * For instance, in Fig. 5.4a min_j=1,max_j=1 for b0 and b17; min_j=0,max_j=1 for b1; min_j=1,max_j=2 for b16; and, min_j=0,max_j=2 for the rest ** these initial event and k-mer indices corresponding to the lower left of the band are computed withrespect to band-width W *** the score of cell that corresponds to k-mer index -1 in band b0 is initiliased to 0 k k k k k k w bandse e e e e e e e e e e e e e v e n t s refno. of k-mers n o . o f e v e n t s D P t a b l e b a nd m o v e m e n t (a) band movement k k k k k k w bandse e e e e e e e e e e e e e v e n t s refno. of k-mers n o . o f e v e n t s D P t a b l e upleftdiag k-merevent (b) computing a single cell score k k k k k k w bandse e e e e e e e e e e e e e v e n t s refno. of k-mers n o . o f e v e n t s D P t a b l e (c) trace-back Figure 5.4: Adaptive Banded Event Alignment153HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT

Algorithm 9

Adaptive Banded Event Alignment - cell score computation

Constants: events _ per _ kmer = n _ eventsn _ kmers (cid:15) = 1 − lp_skip = ln( (cid:15) ) lp_stay = ln(1 − events _ per _ kmer +1 ) lp_step = ln(1 . − e lp _ skip − e lp _ stay ) function computation ( score_prev , score_2ndprev , ref , events , model ) lp_emission ← log_probability_match(ref,events,model) . see Algorithm 10 up,diag,left ← get_scores(score_prev,score_2ndprev) . see red arrows in Fig. 5.4b score_d ← diag + lp_step + lp_emission score_u ← up + lp_stay + lp_emission score_l ← left + lp_skip s ← max(score_d,score_u,score_l) d ← direction from which the max score came end functionAlgorithm 10 Adaptive Banded Event Alignment - log probability computation function log_probability_match ( ref,events,model ) event,kmer ← get_event_and_kmer(ref,events) . see red arrows in Fig. 5.4b x ← event.mean model_kmer ← get_entry_from_poremodel(kmer,model) µ ← model _ kmer.mean σ ← model _ kmer.stdv z ← x − µσ lp_emission ← ln( √ π ) − ln( σ ) − . z end function Algorithm 11

Outline of execution ﬂow for batch of n reads do ... . CPU processing steps before the Adaptive Banded Event Alignment eg: eventdetection memcpy _ ram _ to _ gpu ( ... ) . copy inputs of the Adaptive Banded Event Alignmentto the GPU memory gpu _ alignment ( ... ) . Perform the event alignment on the GPU memcpy _ gpu _ to _ ram ( ... ) . copy results back to the RAM ... . CPU processing steps after the alignment eg: HMM end for t x=0,y=0 t x=1,y=0 t x=WX-1,y=0 block  read t x=0,y=1 t x=1,y=1 t x=WX-1,y=1 block  read t x=0,y=2 t x=1,y=2 t x=WX-1,y=2 block  read t x=0,y=3 t x=1,y=3 t x=WX-1,y=3 block  read t x=0,y=4 t x=1,y=4 t x=WX-1,y=4 block  read t x=0,y=n-2 t x=1,y=n-2 t x=WX-1,y=n-2 block n-2  read n-2 t x=0,y=n-1 t x=1,y=n-1 t x=WX-1,y=n-1 block n-1  read n-1 WX (bandwidth W to the nearest upper 32) ) h c t a b e h t n i s d a e r f o r e b m un ( n Figure 5.5: Thread conﬁguration of pre-kernel t x=0,y=0 t x=1,y=0 t x=WX-1,y=0 block  read t x=0,y=1 t x=1,y=1 t x=WX-1,y=1 block  read t x=0,y=2 t x=1,y=2 t x=WX-1,y=2 block  read t x=0,y=3 t x=1,y=3 t x=WX-1,y=3 block  read t x=0,y=4 t x=1,y=4 t x=WX-1,y=4 block  read t x=0,y=n-2 t x=1,y=n-2 t x=WX-1,y=n-2 block n-2  read n-2 t x=0,y=n-1 t x=1,y=n-1 t x=WX-1,y=n-1 block n-1  read n-1 WX (bandwidth W to the nearest upper 32) ) h c t a b e h t n i s d a e r f o r e b m un ( n k k k k k k w itera (cid:415)(cid:381) n 2e e e e e e e e e e e e e ref e v e n t s i t e r a (cid:415) (cid:381) n k k k k k w itera (cid:415)(cid:381) n 2e e e e e e e e e ref e v e n t s i t e r a (cid:415) (cid:381) n Figure 5.6: Thread assignment of pre-kernel . The assignment for the ﬁrst two reads areshown. Each thread block has a read assigned to it (block refers to threads t X =0 ,y =0 to t x = W X − ,y =0 , and read is processed by all threads in block ; similarly, block refers to t X =0 ,y =1 to t x = W X − ,y =1 and read is processed by threads in block ).156.4. METHODOLOGY Algorithm 12

Adaptive Banded Event Alignment - pre-kernel function align_pre ( ...,model ) . ... refers to other arguments which are laterexplained Section 5.4.2 j ← thread index along x . the x subscript of a thread Fig. 5.5 i ← thread index along y . the y subscript of a thread Fig. 5.5 (ref,score,trace,ll_idx,kcache) ← get_cuda_pointers(i,...) . get memory pointers ofthe arrays corresponding to read i (explained in Section 5.4.2) if j < W then . Though a block is

W X wide (Fig. 5.5) only W threads shouldexecute score[0,j], trace[0,j] ← −∞ , . corresponds to line 21 of Algorithm 8 score[1,j], trace[1,j] ← −∞ , . corresponds to line 22 of Algorithm 8 end if if j==0 then . only thread 0 process this Section ll_idx[0] ← { ei , ki } . corresponds to line 23 of Algorithm 8 ll_idx[1] ← { ei , ki } . corresponds to line 24 of Algorithm 8 score[0, si ] ← . corresponds to line 25 of Algorithm 8 for k=0 to numkmers do . Iterate through each kmer in ref from left to right kmer ← get_kmer_at(ref,k) . k-mer at position k in ref kcache[k] = get_entry_from_poremodel(kmer,model) end for end if end function Algorithm 13

Adaptive Banded Event Alignment - core-kernel function align_kernel_core(...) . ... refers to the arguments which are later explained inSection 5.4.2 j ← thread index along x . the x subscript of a thread Fig. 5.5 i ← thread index along y . the y subscript of a thread Fig. 5.5 (events,score,trace,ll_idx,kcache) ← get_cuda_pointers(i,...) . get memory pointers of thearrays corresponding to read i (explained in Section 5.4.2 n_bands ← n_events + read_len __shared__ c_score[W], p_score[W], pp_score[W] . allocate space in fast shared memoryfor scores of current, previous and 2nd previous bands __shared__ c_ll_idx, p_ll_idx, pp_ll_idx . allocate space in fast shared memory forindexes of lower left cells of current, previous and 2nd previous bands if (j

Adaptive Banded Event Alignment - core-kernel - cell score computation

Algorithm 15

Adaptive Banded Event Alignment - core-kernel - log probability computation. function log_probability_match ( kcache , events ) event ← get_event(events) . see red arrow in Fig. 5.4b x ← event.mean model_kmer ← get_entry_from_kcache(kcache) µ ← model _ kmer.mean σ ← model _ kmer.stdv z ← x − µσ lp_emission ← ln( √ π ) − ln( σ ) − . z end function Note: Changes to Algorithm 10 are highlighted in blue160.4. METHODOLOGY k-mer mean sd

AAAAAA µ σ AAAAAC µ σ AAAAAG µ σ AAAAAT µ σ AAAACA µ σ . . .. . .. . .. . .TTTTTT µ σ pore-model k k k k k k we e e e e e e e e e e e e ref e v e n t s i t e r a (cid:415) (cid:381) n (a) Random accesses to the model array (red arrows) when kcache is not employed k-mer mean sd AAAAAA µ σ AAAAAC µ σ AAAAAG µ σ AAAAAT µ σ AAAACA µ σ . . .. . .. . .. . .TTTTTT µ σ k k k k k r e f pore-model k c a c h e m m m m m k m (b) Construction of the kcache in pre-kernel m m m m m m we e e e e e e e e e e e e kcache e v e n t s i t e r a � o n (c) Spatially local memory accesses(green arrows) when kcache is employed Figure 5.7: Utility of kcache in the core-kernel to improve memory coalescing161HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT

CPU version of the Adaptive Banded Event Alignment (ABEA) algorithm performs dynamicmemory allocations ( malloc ) on a per read basis. The number of reads in a dataset is inthe order of millions and thus incur millions of malloc calls. However, dynamic memoryallocations ( malloc performed inside GPU kernels) are extraordinarily expensive in terms ofexecution time [192]. In-fact, our initial GPU kernel implementation which performed suchmemory allocations was more than 100 × slower than the CPU implementation. An intuitiveapproach of statically allocating memory at the compile time is not practical as nanoporeread lengths vary signiﬁcantly (~100 bases to >1 Mbases as explained previously) and thusthe associated data structures vary from ~200 KB to >1.5 GB. We present a methodologythat signiﬁcantly reduces the number of memory allocations by pre-allocating large chunks ofcontiguous memory at the beginning of the program to accommodate a batch of reads, whichare then reused throughout the life-time of the program. The sizes of these large chunks aredetermined by the available GPU memory and the average number of events per base (i.e.average value of the number of events divided the by the read length). For a given batch ofreads, we assign reads to the GPU until the allocated GPU memory chunks saturate, and therest of the reads are assigned to the CPU.We describe the memory allocation technique in two steps: in Section 5.4.2.1 how the memoryallocation for a batch of reads at a time is performed; and, in Section 5.4.2.2, how the methodin Section 5.4.2.1 can be expanded to reuse large chunks of memory, allocated at the beginningof the program. In the three GPU kernels elaborated in Section 5.4.1, the associated data arrays per each readare ref , kcache , events , score , trace , ll_idx and alignment (ﬁnal output from the post-kernel ).162.4. METHODOLOGYIf any of these arrays are allocated inside the GPU kernels on a per-read basis, for instance if score and trace arrays are allocated at line 4 of Algorithm 12 using malloc ), the performancewill be degraded.We identiﬁed that the sizes of all the aforementioned data arrays are dependent only on theread length (known at run-time during ﬁle reading) and the number of events for the read(known after event detection described in Section 5.2). Thus, the sum of read lengths andthe number of events for a batch of n reads (GPU processes a batch of n reads at a time) isused to calculate the sizes of memory allocations required for the particular batch accordingto the formulation below.Let n be the number of reads loaded to the RAM (from the disk) at a time. Let r [] be theread length and e [] be the number of events for all the reads in batch of n reads. Column 1of Table 5.1, lists the data arrays. The size of arrays ref and kcache depends only on readlengths r ; events and alignment depend on number of events e ; and, score , trace and ll_idx depend on both read length r and number of events e . Based on these dependencies, thearrays are categorised in Table 5.1 by horizontal separators. The second column of Table 5.1states the data-type size of each array, denoted by constants of the form c x . Typical valuesof these constants (in our implementation) are given inside the brackets. For instance, thedata type for ref is char and thus C r is 1 byte. The data type for events is a struct of size C e that is 20 bytes. Note that, the exact values may depend on the implementation and theunderlying processor architecture, nevertheless are constants known at compile time. Thethird column of Table 5.1 shows the size required for the particular array for a single read, i.e.the size for the i th read (assume 0 based index origin) in the batch of n reads. For instance, ref depends on the read length of the particular read and the datatype, thus the size is C r r [ i ]. Score depends on read length, number of events, data type size and band-width ( W ), thus W C s ( r [ i ] + e [ i ]). The last column of Table 5.1 is the total size required for a batch of reads(based on sum of r and e ). For instance, the sum of all the ref arrays for the batch is theproduct of data type size C r and sum of all read lengths in the batch P n − i =0 r [ i ].163HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTTable 5.1: Data arrays associated with ABEA and their sizes Array Data type size Size for read i in batch Size per batch(bytes) ref[] C r (1) C r r [ i ] C r P n − i =0 r [ i ] kcache[] C k (12) C k r [ i ] C k P n − i =0 r [ i ] events[] C e (20) C e e [ i ] C e P n − i =0 e [ i ] alignment[] C a (8) 2 C a e [ i ] 2 C a P n − i =0 e [ i ] score[][] C s (4) W C s ( r [ i ] + e [ i ]) W C s P n − i =0 ( r [ i ] + e [ i ]) trace[][] C t (1) W C t ( r [ i ] + e [ i ]) W C t P n − i =0 ( r [ i ] + e [ i ]) ll_idx[] C l (8) C l ( r [ i ] + e [ i ]) C l P n − i =0 ( r [ i ] + e [ i ])Based on the total array sizes in the last column Table 5.1, we can allocate seven big chunks oflinear contiguous memory in the GPU. Let the base address of those chunks be represented byuppercase letters: REF ; KCACHE ; EVENTS etc. These memory allocations are performedusing cudaMalloc()

API calls, just before the kernel invocations and are deallocated after thekernels. Note that for now, we do these allocations and deallocations for each batch of reads.The GPU arrays

REF , KCACHE , EVENTS etc, allocated using cudaMalloc above are 1Darrays, thus multi-dimensional arrays in the RAM (eg: an array of pointers—each pointerpointing to a string/char array) must be serialised/ﬂattened. One option is to save a seriesof pointers associated to each above array during the serialisation and then utilising thosepointers for addressing a particular element later. However, this can be performed better bystoring only two oﬀset arrays of length n each: read oﬀset array p [], which is the cumulativesum of read lengths in the batch ( p [ i ] = P i − j =0 r [ j ]); and, event oﬀset array q , which is thethe cumulative sum of events in the batch ( q [ i ] = P i − j =0 e [ j ]). Note that, r and e havethe same deﬁnitions as before. These two oﬀset arrays p and q can be used to deduce theassociated pointer to a given element when required, by computing the array oﬀset as shownin Table 5.2a. The ﬁrst column of Table 5.2a is the base address of the large GPU arrays weallocated above. The oﬀset of the element pertaining to the i th read (assume 0-indexing) inthe particular array is given in the second column of Table 5.2a. The deﬁnition of constants C x and W are the same as for the previous Table 5.1. These 1D array base addresses in164.4. METHODOLOGYTable 5.2: GPU data arrays, pointer computation and heuristically determined sizes (a) Computation of pointer for the read i

1D GPU array Oﬀset to element(base address) i in the batch

REF C r p [ i ] KCACHE C k p [ i ] EVENTS C e q [ i ] ALIGNMENT C a q [ i ] SCORE

W C s ( p [ i ] + q [ i ]) TRACE

W C t ( p [ i ] + q [ i ]) LL_IDX C l ( p [ i ] + q [ i ]) (b) Heuristic allocation

1D GPU array Allocated size(base address) per batch

REF C r X KCACHE C k X EVENTS C e Y ALIGNMENT C a Y SCORE

W C s ( X + Y ) TRACE

W C t ( X + Y ) LL_IDX C l ( X + Y )the ﬁrst column of Table 5.2a and the two associated oﬀset arrays p [] and q [], are passed asarguments to the GPU kernels (Algorithm 12 and Algorithm 13). These arguments are usedfor the the memory pointer computation inside the GPU kernels (line 4 of Algorithm 12 andline 4 of Algorithm 13) based on the second column of Table 5.2a.Algorithm 16 elaborates how the above mentioned strategy is integrated into the previousexecution ﬂow depicted in Algorithm 11. Lines 3-7 of Algorithm 16 show how the oﬀsetarrays p and q are computed for each batch of reads. Line 8 of Algorithm 16 performs theserialisation of the multi-dimensional arrays with the use of oﬀset arrays p and q . Line 9of Algorithm 16 allocates GPU arrays based on sizes in last column of Table 5.1. Then,the serialised arrays are copied to allocated GPU memory (line 10 of Algorithm 16), GPUkernels (the three kernels discussed in Section 5.4.1) are executed (line 11) and the alignmentresult is copied back from the GPU (line 12). At the end, the alignment result is convertedback to multi-dimesional arrays (line 13) and then the GPU memory (allocated at line 9) isdeallocated (line 14).The oﬀset arrays p and q (and also REF, KCACHE, EVENTS, etc.) are passed onto theGPU kernels and are utilised inside the GPU kernels to compute the memory pointers (line4 of Algorithms 12 and 13) through the equations listed on the second column of Table 5.2a.165HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT Algorithm 16

Memory allocation—data structure serialisation for batch of n reads do ... . CPU processing steps before the ABEA eg: event detection rs, es ← , . cumulative sum of read lengths and no of events for each read i do p [ i ] , q [ i ] ← rs, es . save current read and event oﬀsets rs ← rs + r [ i ]; es ← es + e [ i ] end for serialise _ ram _ arays ( p, q, ... ) . ﬂatten multi dimensional arrays in RAM to 1Darrays allocate_gpu_arrays(rs,es,...) . GPU arrays REF, KCACHE, EVENTS, etc. memcpy _ ram _ to _ gpu ( ... ) . copy inputs of the ABEA to the GPU memory gpu _ alignment ( p, q... ) . Perform ABEA on the GPU memcpy _ gpu _ to _ ram ( ... ) . copy alignment result back to the RAM deserialise ( p, q, .... ) . convert 1D result array to multi dimensional array f ree _ gpu _ arrays () . free GPU arrays REF, KCACHE, EVENTS, etc. ... . CPU processing steps after ABEA eg: HMM end for

The limitation of this strategy is the GPU memory allocation and de-allocation (line 9 and14 of Algorithm 16) performed for each batch of reads (which is expensive on certain GPUs,see Section 5.5.2.2). This limitation is remedied by the heuristic based pre-allocation strategyexplained in the next subsection.

The GPU memory allocations in the previous section which were performed for each batchcould be eliminated by pre-allocating all the available GPU memory at the startup of the166.4. METHODOLOGYprogram and then re-using for subsequent batches of reads). If the sizes of the arrays dependedonly on the read length, the total read length accommodable into the available GPU memorycan be derived. Then, the available memory can be allocated among the seven large arrays(

REF , KCACHE , EV EN T S etc) in correct proportion. However, these array sizes dependboth on the read length and the number of events which are unknown at the beginning ofthe program; thus, memory cannot be partitioned among the data arrays. Therefore, Wepresent a heuristic approach which exploits characteristic of nanopore data to estimate theproportion to maximally utilise the available GPU memory. In summary, we obtain theaverage number of events per base (average of the number of events divided by read length),use this average to determine the maximum read length that can be accommodated to theGPU, and proportionally allocate the GPU arrays. This approach is formulated as follows.Sum of all the cells in column 4 of Table 5.1 is total memory required for a batch of n reads. This sum simpliﬁes to equation 5.1 (due to the properties of constants) where C R = C r + C k + W C s + W C t + C l and C E = C e + 2 C a + W C s + W C t + C l . This sum represents thetotal size of all array (for adapted banded event alignment algorithm) for a batch of n reads. S = C R n − X i =0 r [ i ] + C E n − X i =0 e [ i ] (5.1)If ¯ µ is the average number of events per base (total number of events divided by the total readlength for all reads in the batch), we can write as P n − i =0 e [ i ] = ¯ µ P n − i =0 r [ i ]. Now substitutingthis in equation 5.1 gives S = ( C R + ¯ µC E ) P n − i =0 r [ i ]. We observed that for a suﬃcient batchsize (>64), ¯ µ is stable ~2.5 (on more than 10 datasets we tested). Let this estimated valuefor ¯ µ be represented by the constant µ . Thus, the total memory required for a batch of readscan be estimated using equation 5.2. M = ( C R + µC E ) n − X i =0 r [ i ] (5.2)167HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTEquation 5.2 can be used to estimate the maximum number of bases (sum of read lengths) thata given amount of GPU memory can accommodate. Let M in equation 5.2 be the availableGPU memory. Then, the approximate maximum number of bases X that ﬁts available GPUmemory M can be computed via equation 5.3. Then, the associated total number of totalevents Y which the GPU memory can accommodate, is found by equation 5.4. X = f loor (cid:18) MC R + µC E (cid:19) (5.3) Y = f loor ( µX ) (5.4)These X and Y allow the available GPU memory to be allocated among the seven large arrays( REF , KCACHE , EV EN T S etc) with approximately correct proportions, as shown in thesecond column of Table 5.2b. The values in the second column of Table 5.2b are obtained bysubstituting P n − i =0 r [ i ] with X and P n − i =0 e [ i ] with Y in the last column of Table 5.1.By incorporating the above heuristic based memory allocation strategy to Algorithm 16, weget the execution ﬂow in Algorithm 17. The major changes to the previous Algorithm 16 arehighlighted in blue text. Now the GPU memory is allocated at the beginning of the programbased on the estimated X and Y on line 1 of Algorithm 17. As X and Y are approximations,the GPU arrays may saturate for certain batches of reads. Line 6 of Algorithm 17 checksif GPU arrays are saturated and assigns the read to either GPU (line 9) or CPU (line 11),accordingly. Only a few reads are assigned to the CPU and these few reads are processed onthe CPU in parallel to the GPU kernel execution, and thus no additional execution time isincurred.With the heuristic based memory pre-allocation strategy described in this section, cudaMalloc operations are invoked only at the beginning of the program and thus no additional memoryallocation overhead during the processing. Note that, our implementation is future proof; i.e.168.4. METHODOLOGY µ is a user speciﬁed parameter (that is initialised to 2.5 by default) in case nanopore datacharacteristics change in future. If all the reads were of similar length, GPU threads that process the reads would completeapproximately at the same time, and thus GPU cores will be busy throughout the execution.However, as stated in Section 5.2, there can be a few reads which are signiﬁcantly longer thanthe other reads (we will refer to them as very long reads ). When the GPU threads processreads in parallel, presence of such very long reads will cause all other GPU threads to waituntil the GPU threads processing the longest read complete. This thread waiting leads tounder utilisation of GPU cores. Thus, we process these very long reads on the CPU whilethe GPU is processing the rest in parallel. However, there can be exceptionally long reads(we will refer to them as ultra long reads ) which the CPU would take longer time than whatthe GPU took to process the whole batch. Such reads would lead the GPU to idle until theCPU completes. Thus, ultra long reads will be skipped and will be processed separately atthe end by the CPU. Similarly, there can be a few over segmented reads (i.e. reads with asigniﬁcantly higher events per base ratio than the others) which cause GPU under utilisation.These over-segmented reads will also be processed on the CPU.We discuss these problems of very long reads and ultra long reads in detail with examplesin Section 5.4.3.1, along with the solutions. Then, in this Section 5.4.3.2, we discuss theproblem of over segmented reads and the respective solution. Then, in Section 5.4.3.3, wediscuss another factor that aﬀects performance, the batch size (number of reads loaded tothe RAM at a time). Finally, in Section 5.4.3.4, we describe a method to detect and promptthe user of any drastic impacts on performance along with suggestions to tune parameters tominimise the impact. 169HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT

Consider a batch of reads where ~90% of the reads are less than 30 Kbases in length. Assumethe longest read in the batch is 90 Kbases. Assume that the GPU is processing all the reads(in the batch) in parallel. Suppose that GPU threads processing reads of length <30 Kbases(90% of the threads) would complete in <300ms while GPU threads processing the longest 90Kbases read would take 900ms. As a result, the completed GPU threads will have to wait foradditional 600ms. Similarly, the few very long reads consume a signiﬁcant time to process onthe GPU in comparison to other reads in the batch. Majority of the GPU threads will have towait and this causes under-utilisation of GPU compute-cores. Furthermore, very long reads negatively aﬀects the GPU occupancy by occupying a signiﬁcant portion of GPU memory.For instance, a read of size ~10 Kbases requires only ~18 MB of GPU memory while a readwith 90 Kbases requires ~160MB memory. Hence, very long reads occupy a signiﬁcant portionof GPU memory, limits the number of reads that could be processed in parallel. This reducesthe amount of parallelism and the occupancy of the GPU is reduced.Fortunately, very long reads being few (see the typical read length distribution under results),the CPU (core frequency faster than on GPU) could process those reads while GPU is pro-cessing the rest of the reads. In the above example, selecting a static threshold (eg: processingreads of length <30Kbases on GPU and rest on CPU) would give reasonable performance.However, selecting such a static threshold is not ideal due to variations in the read lengthdistributions based on the dataset (see background). Thus, we use the product of max-lf andthe average read length in the batch to determine the threshold dynamically, where max-lf isa user-parameter that defaults to 5.0. This threshold was empirically determined.Now assume amongst the very long reads processed on the CPU, a few ultra long reads (eg:read >100 Kbases in a dataset where >99% of the reads are <100 Kbases). Such ultra longreads could cause a severe load imbalance between the CPU and the GPU. For instance,assume that there exists a read which is 1 Mbases in a given read batch. Despite the high170.4. METHODOLOGYcore frequency, the CPU will take a few seconds to process such an ultra long read . The GPUmeanwhile would process the whole batch in less than 1s (see results for empirical evidence).Such ultra long reads being <1%, are skipped during the processing (while being written toa separate ﬁle) and are separately processed by the CPU at the end. In our implementation,the threshold for ultra long reads is a user deﬁned parameter which defaults to 100 Kbases.There is an additional advantage of processing ultra long reads later.

Ultra long reads usuallyrequire a signiﬁcant amount of RAM (a few gigabytes) and may crash on limited memorysystems. In the end, it is possible to process these reads with a limited amount of threads toreduce the peak memory consumption, particularly if the size of the RAM is limited.

Once the very long reads and ultra long reads are processed as in Section 5.4.3.1, the perfor-mance impact due to the over-segmented events become prominent. While majority of thereads have a number of events per base that is close to the average µ (= 2 . µ (= 2 .

5) can violate the suitability of our partitioningof GPU memory as X and Y ( X and Y are derived in equations 5.3 and 5.4). These over-segmented reads lead to the GPU arrays that are proportional to Y be full, while the arraysproportional to X are left under-utilised. For instance, arrays proportional to Y can become100% while arrays proportional to X are only ﬁlled to <70%. Hence, over segmented readslead to under-utilisation of GPU memory and results in limiting the number of reads whichare processed in parallel. We process the over-segmented reads on the CPU based on a userspeciﬁable threshold max-epk which defaults to 5.0.On rare occasions, reads with >100 events per base were observed. Such severely over-segmented reads can be processed separately at the end or ignored totally as such rare readsamongst millions of other reads are unlikely to aﬀect the ﬁnal polishing result.171HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT Selection of proper batch size (reads loaded to RAM from the disk at a time) is anotherimportant parameter that aﬀects performance. If the batch size is too small compared towhat the GPU memory can accommodate, the number of reads to be processed in parallel islimited, thus leads to in-adequate occupancy. Conversely, if the batch is too large to ﬁt theGPU, CPU will have to process many surplus reads that could not be accommodated into theGPU. The batch size in our implementation is determined by two user speciﬁed parameters: K which is the maximum number of reads; and, B which is the maximum number of total bases.When reading from the disk to RAM, the true batch size ( n -number of reads and b -numberof total bases are capped by K and B ) is determined by the ﬁrst value ( n or b ) reaching thecap ( K or B ) ﬁrst. Having such a limit B allows to cap peak RAM due to adjacent very longreads . The suitable value for B is dependent on the available GPU memory, which can beestimated via the equation 5.3 discussed in Section 5.4.2. While we have empirically determined typical parameters/thresholds (associated with abovestrategies), an unusual situation (for instance, a big gap between the CPU and GPU speciﬁ-cations or a data set that severely deviates from the heuristics we use) may cause performanceanomalies. We employ the following method to detect a severe performance anomaly causedby such an unusual scenario.We measure the quantities representing resource utilisation during run time, which are listedin Table 5.3. These quantities are measured per batch of reads loaded to the RAM at a time.We use those measured quantities to determine any severe performance issues and suggestsuitable parameter adjustments to the user. The adjustable parameters (or thresholds) thatcan be tweaked to improve the resource utilisation are deﬁned in Table 5.4. Determination of172.4. METHODOLOGYperformance issues and suggestions are done via two decision trees, one that corresponds toGPU memory usage (Fig. 5.8a) and another which corresponds to balancing the load betweenCPU and GPU (Fig. 5.8b). quantity description t CPU processing time on CPUt

GPU processing time on GPUX util utilisation percentage of the arrays proportional to X ( rs as a percentageof X in Algorithm 17)Y util utilisation percentage of the arrays proportional to Y ( es as a percentageof Y in Algorithm 17)N memout number of reads assigned to CPU due to GPU memory getting prema-turely full (corresponds to line 11 of Algorithm 17)N long number of very long reads assigned on to the CPU (corresponds to userparameter max-lf )N events number of reads with too many events per read assigned onto the theCPU (corresponds to user parameter max-epk ) n the number of reads actually loaded to the RAM b the number of bases actually loaded to the RAMTable 5.3: measured quantities parameter description max-lf reads with length ≤ max-lf × average_read_length are assigned to GPUand rest to CPU avg-epk average number of events per base used for allocating GPU arrays asdiscussed previously ( µ ) max-epk reads with events per base ≤ max-epk are assigned to GPU, rest to CPU K upper limit of the batch size with respect to the number of reads B upper limit of the batch size with respect to the number of bases t number of CPU threads ultra-thresh threshold to skip ultra long reads Table 5.4: adjustable user parametersFig. 5.8a shows the decision tree that detects any imbalance in the proportions X and Y associated with GPU arrays allocation ( X and Y derived in equations 5.3 and 5.4). Theobjective of this decision tree is to detect any GPU memory wastage and to increase thenumber of reads which the GPU gets to process in parallel.173HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT X util >70% Y util >70% Y util >70% no action no actionX util -Y util >30% Y util -X util >30% S1: ↑ max-epk ↓ avg-epk S2 : ↓ max-epk ↑ avg-epk yes no action no action S4 : ↑ K n=< t

GPU N memout or N long or N events >10% do nothing T4: ↓ ultra-thresh ↑ threads T5 : ↓ max-lf ↓ max-epk ↑ ultra-threshN memout >10% N long >10% T1: ↓ K T2: ↑ max-lf T3: ↑ max-epk t CPU >>t

GPU t GPU >>t

CPU t GPU ==t

CPU (b) load balancing

Figure 5.8: Decision trees for resource optimisationAs shown in Fig. 5.8a, if both X util and Y util ( rs as a percentage of X and es as a percentage of Y in Algorithm 17) are more than 70%, the utilisation of GPU arrays is considered reasonable.Note that 70% is an empirically determined value that provides adequate performance. If174.4. METHODOLOGY X util is reasonable (>70%) and Y util is unreasonable (<70%), we inspect for any signiﬁcantimbalance between X util and Y util ( X util - Y util >30%). Such a signiﬁcant gap suggests anunder-utilisation, which should be remedied through the increase of max-epk (the thresholdat which over-segmented reads are oﬄoaded to the CPU) or reducing Y by decreasing average-epk (node S1 in Fig. 5.8a). In contrast, if Y util is reasonable and the X util is unreasonable,the strategy is the opposite, i.e, either decrease max-epk or increase average-epk (follow up tothe node S2 in Fig. 5.8a).If both X util and Y util are less than 70%, a likely cause is an inadequate batch size to ﬁllthe GPU memory. The actual batch size ( n , b ) is determined by both K and B as statedpreviously. As shown in Fig. 5.8a, we check which limit out of K and B was reached ﬁrst.If both n < K and b < K , the currently processed batch being the last batch in the dataset(end of input data reached) is the likely cause. Thus, no parameter tuning action is necessary.If B was reached ﬁrst ( n < K and not b < B ), B is the limitation and should be increased(S3 in Fig. 5.8a). If K was reached ﬁrst (not n < K and b < B ), K should be increased (S4in Fig. 5.8a).Fig. 5.8b shows the decision tree for CPU-GPU workload balancing. For a particular batch, ifthe CPU takes signiﬁcantly more time than the GPU, the decision tree ﬁrst inspects whetherthe CPU is assigned with an excessive workload. An excessive workload on the CPU canbe attributed by: an extensively over-sized batch size (in comparison to the available GPUmemory), which results in a majority of the reads being assigned to the CPU (N memout >10%);excessive number of very long reads assigned to the CPU (N long >10%); and, excessive numberof over-segmented reads events assigned to the CPU (N events >10%). If N memout >10%, K isreduced (node T1 in Fig. 5.8b); if N long >10%, max-lf is increased (T2 in Fig. 5.8b); and, ifN events >10%, max-epk is increased (T3 in Fig. 5.8b).If the cause for higher CPU time is not the aforementioned excessive workload, a likely causeis ultra long reads , where a single ultra long reads processed on the CPU taking more time175HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTthan the time taken by GPU for the whole batch. In such an event, ultra-thresh threshold isreduced so that more ultra long reads are skipped. Another likely cause is that the programwas executed with inadequate threads (if the CPU had more hardware threads than theprogram was launched), which is to be remedied by increasing the number of CPU threads.Another cause might be that the CPU is not suﬃciently powerful to match with the GPUand thus no action can be taken (except upgrading the CPU). These actions are denoted byT4 in Fig. 5.8b.The ideal case is when the CPU and GPU take similar times which requires no intervention.Conversely, if the GPU takes signiﬁcant time than the CPU, the likely causes are very longreads or over-segmented reads. In such event, the thresholds max-lf and max-epk are decreasedso that more very long reads and over-segmented reads are assigned to the CPU. Anotherlikely cause is the ultra long read which can be remedied by increasing ultra-thresh threshold.Another cause might be an insuﬃciently powerful GPU (less compute cores or less memory)compared to the CPU and no action is taken (except to upgrade the GPU).To reduce false positives due to incidental under utilisation, a suggestion is provided to theuser, only if the same condition (condition that led to the decision in the decision tree, S1 toS4 T1 in Fig. 5.8a and T1 to T4 in Fig. 5.8b) consecutively repeats more than a few times(eg: >3 times).Note that the above mentioned strategy is to warn and suggest of potential parameter ad-justments in the event of drastic performance degradation, rather than to obtain optimalperformance or to determine the exact parameter values.176.5. RESULTS Experimental setup is given in Section 5.5.1. In Section 5.5.2, we present experimental ev-idence that justify the selection of steps presented in Section 5.4. Next in Section 5.5.3,we compare the GPU implementation of the Adaptive Banded Event Alignment (ABEA)algorithm to its CPU implementation. Finally, we show the overall speedup of the GPUimplementation when incorporated into an actual work-ﬂow (i.e. detection of methylatedbases).

We re-engineered the

Nanopolish methylation calling tool (existing methylation detectiontool discussed in Section 5.2) to: one, load a batch of n reads from disk to RAM at a time,instead of on-demand loading; two, synchronise CPU threads prior to GPU kernel invocation( Nanopolish assigns a thread dynamically to a particular thread, thus each read follows its owncode path); and three, optimise the CPU implementation which otherwise would result in anapparent un-fair speedup (when the optimised GPU version is compared to an un-optimisedCPU version). Re-engineered

Nanopolish employs a fork-join multi-threading model (withwork stealing) implemented using C POSIX threads. ABEA algorithm for the GPU wasimplemented using CUDA C. This re-engineered

Nanopolish will be hitherto referred to as f5c .We used publicly available NA12878 (human genome) Nanopore WGS Consortium sequencingdata [56] for the experiments. The datasets used for the experiments, their statistics (numberof reads, total bases, mean read length and maximum read length) and their source are listedin Table 5.5. D small which is a small subset, is used for running on a wide range of systems(all systems in Table 5.6: embedded system, low-end and high-end laptops, workstation andhigh-performance server). Two complete MinION data sets (D ligation and D rapid ) are only177HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTTable 5.5: Information of the datasets

Dataset Number ofreads Numberof bases(Gbases) Mean readlength(Kbases) Max readlength(Kbases) Source / SRAaccession D small ligation rapid Table 5.6: Diﬀerent systems used for experiments

SystemName Info CPU CPUcores/threads RAM(GB) GPU GPUmem(GB) GPUarch

SoC NVIDIA JetsonTX2 embeddedmodule ARMv8 Cortex-A57 + NVIDIADenver2 6 / 6 8 Tegra sharedwithRAM Pascal/ 6.2lapL Acer F5-573Glaptop i7-7500U 2/4 8 Gefore940M 4 Maxwell/ 5.0lapH Dell XPS 15 lap-top i7-8750H 6/12 16 Gefore1050Ti 4 Pascal/ 6.1ws HP Z640 work-station Xeon E5-1630 4/8 32 TeslaK40 12 Kepler/ 3.5HPC Dell PowerEdgeC4140 Xeon Silver 4114 20/40 376 TeslaV100 16 Volta /7.0 tested on three systems due to large run-time and incidental access to the other two systems.D ligation and D rapid represent the two existing nanopore sample preparation methods (ligationand rapid [199]) that aﬀects the read length distribution.D small dataset was used for experiments under Sections 5.5.2.2, 5.5.2.1 and 5.5.3.1. For ex-periments under Sections 5.5.3.2 and 5.5.4, the datasets D rapid and D ligation were used.To obtain the results for Section 5.5.2.3, ﬁrst we grouped the reads in dataset D rapid based ontheir read lengths. We grouped the read into 10 Kbases bins (i.e., 0K-10K,10K-20K...90K-100K). Reads with >100 Kbases were grouped into larger bins (100K bin sizes; 100K-200K,200K-300K and 200K-300K) as the read count is very little in the range that certain 10Kbins would contain no reads at all. Then, we ran f5c with only CPU and f5c with GPUacceleration on each group of the reads separately. Then, we computed the speedup of ABEA178.5. RESULTSfor each group of reads: the kernel only speedup (

GPU kernel time / time on CPU ); and, thespeedup with overheads (overheads such as memory copy, data structure serialisation). Thisexperiment was performed on the system lapH .For Sections 5.5.2 and 5.5.3, time measurements were obtained by inserting gettimeofday function invocations directly into the C source code. Total execution time and the peak RAMusage in Section 5.5.4 were measured by running the

GNU time utility with the verbose option.

Fig. 5.9a shows the time consumed by the three GPU kernels after applying the computeoptimisation techniques discussed in Section 5.4.1. Time taken by each of the three GPU ker-nels ( pre-kernel , core-kernel and post-kernel ) is plotted for each diﬀerent GPU. It is observedthat the core-kernel , which computes the dynamic programming table (compute-intensive por-tion), still consumes the majority of the GPU compute time. The pre-kernel which performsdata structure initialisation consumes much lesser time and shows that there is no need tofurther parallelise the loop in Algorithm 12 (explained in Section 5.4.1). Despite the lack ofﬁne-grained parallelism in post-kernel (which performs backtracking), the elapsed time is stillconsiderably lesser than the core-kernel . Thus, any future optimisations should still mainlyfocus on the core-kernel , followed by the post-kernel .The eﬃcacy of our compute optimisations on the compute intensive core-kernel can be elabo-rated using the reported statistics from the NVIDIA proﬁler (instruction level proﬁling—PCsampling in NVIDIA visual proﬁler [201]). The proﬁler reports the percentage distributionof reasons that caused the thread warps to stall, based on the number of clock cycles. Thepercentage of the number of clock cycles that a warp was stalled due to a memory dependency(waiting for a previous memory accesses to complete), improved from 59.10% to 44.81% after179HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTthe use of GPU shared memory. After exploiting the kcache for improving memory coalescing,this percentage further improved to 28.62%. As stated in Section 5.4.2.1, the data array serialisation technique eliminated all memoryallocations inside GPU kernels ( malloc ); still, required memory allocations per each batch ofreads ( cudaMalloc ). The overhead due to these cudaMalloc calls are plotted in Fig. 5.9b alongwith the time for kernel execution and data transfer to/from the GPU (using cudaMemcpy ).Observe that on certain GPUs (Jetson TX2, GeForce 940M and Tesla K40), the overheadsdue to cudaMalloc operations are signiﬁcant in comparison to the compute kernels (evenhigher than the compute kernels in Jetson TX2). Such signiﬁcant overheads justify why weproposed a heuristic based memory pre-allocation technique (Section 5.4.2.2) which completelyeliminates this overhead.Interestingly, Tesla K40 and Gefore 940M which incurred high cudaMalloc overheads are ofrelatively older GPU architectures in comparison to GeForce 1050 and Tesla V100, where theoverheads were minimal. This is probably due to hardware supported memory allocation inlatest GPU architectures. However, the aforementioned observation seems to be valid onlyfor GeForce GPUs (targeted for gaming on PC/laptops) and Tesla GPUs (targeted for highperformance computing). On Tegra GPUs (SoC targeted for embedded devices) the overheadseems to be signiﬁcant in spite of the latest architectures (Jetson TX2 is the same Pascalarchitecture as GeForce 1050). We additionally tested on a Jetson AGX Xavier (the mostrecent Tegra GPU based SoC — Volta architecture) and cudaMalloc was yet expensive (40s onGPU kernels and 44s on cudaMalloc , not shown in ﬁgure). Thus, our memory pre-allocationstrategy (in Section 5.4.2.2) which totally eliminates this cudaMalloc overhead is speciﬁcallybeneﬁcial for GPU on SoCs. 180.5. RESULTS

We stated in Section 5.4.3 that very long reads if processed on the GPU, limits the GPUoccupancy. Fig. 5.9c provides experimental evidence and shows the need to process very longreads on CPU (explained in Section 5.4.3). Fig. 5.9c plots the variation of the speedup (GPUcompared to CPU for ABEA) as the read length varies. The x-axis labels the range of theread length for which the speedup was computed (explained in the experimental setup). Forinstance, 0-10 on the x-axis refers to the group of reads with read length 0-10Kbases. Notethat in Fig. 5.9c the bins are 100K wide from 100K-200K on-wards, due to less number ofreads of those lengths (explained in the experimental setup). The speedup of computations (GPU kernel time / CPU time) and the speedup including overheads (GPU kernel time +overheads such as memory copy, data structure serialisation) are plotted in Fig. 5.9c. Speedupof more than 4X was observed for smaller read lengths (0-10K). speedup drops with increasingread-length and is less than 3X from 50K-60K. The longer the reads are, the lesser number ofreads can be processed in the GPU in parallel (reduced occupancy), thus the reduced speedup.Hence, very long reads that signiﬁcantly aﬀects the performance should be performed on theCPU while the GPU is processing the rest.Fig. 5.9d shows the need for processing ultra long reads separately (explained in Section5.4.3). The x-axis in the ﬁgure is the read-length (similar to Fig. 5.9c). The blue bars (withreference to the right y-axis) denote the average time consumed by the GPU to process abatch of reads (1.5 Mbases), for each group of read lengths from 0 bases to 50Kbases. Theorange bars (with reference to the right y-axis) denote the average time consumed by theCPU (1 thread) to process a single read in the particular group of reads. The read lengthdistribution (left y-axis) is shown shaded in green colour to depict the abundance of reads ineach read length. Observe that CPU takes >1.6s for a single read of 300K-400K length whilethe GPU completes a whole 40K-50K batch in <0.4s. Thus, the GPU would idle for >1.2suntil the CPU completes processing. Hence, such ultra long reads (eg : >100 Kbases) must181HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTbe skipped and processed separately at the end. Note that such ultra long reads are very few(green coloured read length distribution in Fig. 5.9d). t i m e ( s ) pre-kernel core-kernel post-kernel (a) Distribution of GPU kernel execution time SoC(JetsonTX2) lapL(GeForce940M) lapH(GeForce1050 Ti) ws(TeslaK40) HPC(TeslaV100) t i m e ( s ) kernel execution cudaMalloc cudaMemcpy (b) Time for GPU kernels compared cudaMalloc - -

20 20 - -

40 40 - - -

70 70 - - -

100 100 -

200 200 - - s p ee d - up read length (x10 )kernel + overheads kernel only (c) Eﬀect of the read length on the speedup - - -

30 30 -

40 40 -

50 50 -

60 60 - - -

90 90 -

100 100 -

200 200 -

300 300 - t i m e ( s ) read length (x10 ) n o . o f r e a d s ( x ) no. of reads (x10 ) CPU (per read) GPU (per batch) (d) need for load-balancing based on read lengths Figure 5.9: Eﬀect of individual optimisations

In this subsection, we present the performance of the GPU ABEA implementation when allthe optimisations in Section 5.4 are applied together. Note that we compare this optimised182.5. RESULTSGPU version with optimised CPU version in f5c (not the CPU version in original

Nanopolish ).The CPU version was run with maximum supported threads on the system. The optimisedCPU version will be hitherto referred to as

CPU-opti and the optimised GPU version will bereferred to as

GPU-opti . First, we compare the run-time of

CPU-opti and

GPU-opti on awide range of diﬀerent computer systems in Section 5.5.3.1, and then on the two big datasetsin Section 5.5.3.2.

Fig. 5.10a shows the time for

CPU-opti (left bars) and the

GPU-opti (right bars) for theDataset D small , for each system listed in Table 5.6. The run-time for the GPU has beenbroken down in to: compute kernel time; diﬀerent overheads (memory copying to/from theGPU, data serialisation time); and, the extra CPU time due to reads processed in the CPU.The compute kernel time includes the sum of time for all the three kernels ( pre-kernel , core-kernel and post-kernel ). The extra CPU time is the additional time spent by the CPU toprocess the reads assigned to the CPU (excluding the processing time that overlaps with theGPU execution, i.e. only the extra time which the GPU has to wait after the execution isincluded). Note that the ultra long reads were not separately processed on the CPU as theD small contains a minuscule number of ultra long reads .Speedups (including all the overheads) observed for CPU-opti compared to

GPU-opti are:~4.5 × on the low-end-laptop and the workstation; ~4 × on Jetson TX2 SoC; and ~3 × onhigh-end-laptop and HPC. Note that only ~3 × speedup on high-end-laptop and HPC (incomparison to >=4 × on other systems) is due to the CPU on those particular systems havingcomparatively a higher number of CPU threads (12 and 40 respectively).183HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT t i m e ( s ) CPU-opti (multi-threaded) cuda kernelscuda memcopy data serialisation extra CPU time (a) Performance comparison of ABEA on CPU vsGPU for D small over a wide range of systems SoC lapH HPC SoC lapH HPC � m e ( s ) CPU-op ti (mul ti -threaded)cud a memcopy CPU very-lon g reads cuda kernels data serialisation CPU ultra-long reads D rapid D liga�on (b) Performance comparison of ABEA on CPU vsGPU across full datasets Figure 5.10: Speedup of ABEA on GPU compared to CPU

Time taken for

CPU-opti compared to

GPU-opti for the two big datasets (D ligation and D rapid )is shown in Fig. 5.10b. Experiments were performed only on three systems due to the limitedavailability of other devices (mentioned previously). The graph is similar to the previous Fig.5.10a, except the extra CPU time has been further broken down to:

CPU very-long reads ;and,

CPU ultra long reads . CPU very long reads refers to the additional time spent by theCPU to process very long reads and,

CPU ultra long reads refer to the ultra long reads (reads>100 Kbases) processing time performed separately on the CPU. A speedup up of ~3 × wasobserved for all three systems. Due to more ultra long reads in D ligation and D rapid than inD small , the overall speedup for SoC is limited to around ~3 × compared to ~4 × for D small . f5c compared with original Nanopolish

In this section, we demonstrate the overall performance when the GPU accelerated ABEA isincorporated into an actual methylation detection work-ﬂow. As stated in the experimental184.5. RESULTSsetup, we re-engineered

Nanopolish to overcome the limitations of original

Nanopolish . Wecompare the total run-time for methylation calling using original

Nanopolish against f5c (bothCPU only and GPU accelerated versions).We refer to original

Nanopolish (version 0.9) as nanopolish-unopti , f5c run only on the CPUas f5c-cpu-opti and GPU accelerated f5c as f5c-gpu-opti . We executed nanopolish-unopti , f5c-cpu-opti and f5c-gpu-opti for the full datasets D rapid and D ligation . Note that all the executionswere performed with the maximum number of CPU threads supported on each system.The run-time results are shown in Fig. 5.11. The reported run-times are for the wholemethylation calling (all steps mentioned in Section 5.2.2) and also includes disk I/O time.As each read executes on its own code path in original Nanopolish (as mentioned in theexperimental setup) the time for individual components (eg: ABEA) cannot be accuratelymeasured, thus we only compare the total run-times. f5c-cpu-opti for D rapid dataset was: ~2 × faster on SoC and lapH; and, ~4 × faster on HPC. nanopolish-unopti crashed on SoC (8GB RAM) and lapH (16GB RAM) when run for D ligation dataset due to Linux Out Of Memory (OOM) killer [202]. When run for D ligation on HPC, f5c-cpu-opti was not only 6 × faster than original Nanopolish , but also consumed only ~15 GBRAM opposed to >100 GB by original

Nanopolish (both run with 40 threads). Hence, it isevident that CPU optimisations alone can do signiﬁcant improvements.As per Fig. 5.11 for the whole methylation-calling process (including disk I/O), f5c-gpu-opti (only ABEA is performed on GPU) compared to f5c-cpu-opti was 1.7 × faster on SoC, 1.5-1.6 × on the lapH and <1.4 × on HPC. On HPC the speedup was limited to <1.4 × due to ﬁleI/O being the bottleneck.When the execution time of f5c-gpu-opti for D rapid is compared with original Nanopolish , itis ~4 × , ~3 × and ~6 × faster on SoC, laptop and HPC, respectively. On HPC for D ligation , f5c-gpu-opti was ~9 × faster. 185HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENT SoC lapH HPC SoC LapH HPC D rapid D liga�on � m e ( s ) nanopolish-unopti f5c-CPU-opti f5c-GPU-opti N a n o p o li s h c r a s h e d N a n o p o li s h c r a s h e d Figure 5.11: Comparison of f5c to Nanopolish

Note that we used

Nanopolish v0.9 for comparison as the re-engineering was done on thisparticular version. As we incorporated a number of CPU optimisations identiﬁed duringthe re-engineering into the subsequent version of

Nanopolish (only those that did not requiremajor code refactoring), latest

Nanopolish v0.11 should be faster than v0.9 used in this paper.

With the method discussed in this paper, the complete methylation calling of a human genomecan now be performed on-the-ﬂy (process in real-time while the nanopore sequencer is oper-ating) on an embedded system (e.g., an SoC equipped with ARM processor and an NVIDIAGPU) as shown in Fig. 5.12 (four Oxford Nanopore MinION devices sequencing in parallel186.6. DISCUSSION sequen- cing base- calling base alignment methylation calling signal alignment othermethylation calling signal alignment other ~630kbases/s raw signal ACGT bases 6 core ARM+

Figure 5.12: Human genome processing on-the-ﬂyor a single Oxford Nanopore GridION, is capable of sequencing a human genome at an ad-equate coverage). f5c powered by GPU accelerated ABEA can process the output from therest of the pipeline on a single NVIDIA TX2 SoC, at a speed of (>600 Kbases per second)to keep up with the sequencing output (~600 Kbases per second [196]) as shown in Fig. 5.12.Conversely, if the original

Nanopolish was executed on the NVIDIA TX2 SoC, the processingspeed is limited to ~256 Kbases per second. Our work will not only reduce the associatedcosts of Nanopore data processing and data transfer, but will also improve turnaround timeof the ﬁnal test outcome.In addition to embedded systems, our work beneﬁts systems with or without GPU. Due toreduced peak memory usage, methylation calling can be performed on laptops with <16GBof RAM. Furthermore, post sequencing methylation calling execution on high performancecomputers also beneﬁt from a signiﬁcant speedup in processing.A limitation of our implementation is that the parameter tuning cannot be performed auto-matically, which instead prompts the user when an un-optimal parameter is detected. Thislimitation is expected to be addressed in a future version by automatically tuning the param-eters at run-time; or, by the use of pre-set parameter proﬁles for diﬀerent types of datasetsand/or computer systems.The documentation of f5c is in appendix B. Supplementary material on the design, develop-187HAPTER 5. GPU ACCELERATED ADAPTIVE BANDED EVENT ALIGNMENTment and deployment of f5c is available in appendix C and appendix D.

Adaptive Banded Event Alignment algorithm is one of the key components in nanopore dataanalysis. Despite this algorithm being not embarrassingly parallel, we presented an approachthat makes this algorithm eﬃciently execute on GPUs. The high variability of the read lengthswas one of the main challenges, which was remedied through a number of memory optimisa-tions and a heterogeneous processing strategy that uses both CPU and GPU. Our optimisa-tions yield around 3-5 × performance improvement on a CPU-GPU system when compared toa CPU. We incorporated the optimised Adaptive Banded Event Alignment algorithm into amethylation detection workﬂow and demonstrated that an embedded SoC equipped with anARM processor (with six cores) and NVIDIA GPU (256 cores) is adequate to process datafrom a portable nanopore sequencer in real-time.This work not only beneﬁts embedded SoC, but also a wide range of systems equipped withGPUs from laptops to servers. The re-engineered version of the Nanopolish methylationdetection module, f5c that employs the GPU accelerated Adaptive Banded Event Alignmentwas not only around 9 × faster on an HPC, but also reduced the peak RAM by around 6 × times. The source code of f5c is made available at https://github.com/hasindu2008/f5c .188.7. SUMMARY Algorithm 17 heuristic memory allocation scheme allocate_gpu_arrays(X,Y) . pre-allocate GPU arrays REF, KCACHE, EVENTS, etc. for batch of n reads do ... . CPU processing steps before the ABEA eg: event detection rs, es ← , . cumulative sum of read lengths and no of events for each read i do if ( rs + r [ i ] ≤ X and es + e [ i ] ≤ Y ) then . check if GPU arrays have adequatefree space p [ i ] , q [ i ] ← rs, es . save current read and event oﬀsets rs ← rs + r [ i ]; es ← es + e [ i ] assign_to_gpu(i) . GPU arrays have space, thus assign read to the GPU else assign_to_cpu(i) . a GPU arrays is already full, thus assign the read to theCPU end if end for serialise _ ram _ arays ( p, q, ... ) . ﬂatten multi dimensional arrays in RAM to 1Darrays memcpy _ ram _ to _ gpu ( ... ) . copy inputs of the ABEA to the GPU memory gpu _ alignment ( p, q... ) . Perform ABEA on the GPU process_rest_on_cpu() . execute on the CPU in parallel to the GPU kernels memcpy _ gpu _ to _ ram ( ... ) . copy alignment result back to the RAM deserialise ( p, q, .... ) . convert 1D result array to multi dimensional array ... . CPU processing steps after ABEA eg: HMM end for free_gpu_arrays() . free GPU arrays REF, KCACHE, EVENTS, etc.Note: Changes to Algorithm 16 are highlighted in blue189HAPTER 6. SYSTEM INTEGRATION Chapter 6

System Integration

This chapter presents how the diﬀerent optimisations proposed in previous chapters are in-tegrated to construct diﬀerent prototype embedded systems that perform end-to-end DNAanalysis workﬂows.In collaboration with two other PhD candidates in the research group, an embedded sys-tem called SWARAM was constructed for performing a variant calling pipeline for second-generation sequencing. SWARAM consisted of 16 Odroid XU4 single board computers (SBC)interconnected through Ethernet. The optimisations to the

Platypus variant caller presentedin chapter 3 are applied in SWARAM to facilitate fast variant calling. However, the inte-gration details and the architecture of SWARAM have been generously shared to be used inanother PhD candidate’s thesis and thus not discussed or claimed under this thesis. Refer tothe published article at [27] for those details.Inspired by SWARAM, another embedded system called the nanopore-cluster was constructed,190.2. SYSTEM ARCHITECTURE OF NANOPORE-CLUSTERnow in a diﬀerent architecture to SWARAM, to process third-generation nanopore sequencingdata. The optimisations proposed in chapter 4 and 5 were used in this nanopore-cluster toenable a real-time workﬂow for nanopore sequencing data. As mentioned in section 2.1.2.3,nanopore is a highly portable technology. Thus, an embedded like the nanopore-cluster systemis harmonious with the ultimate goal of such ultra-portable sequencers to enable completeDNA sequencing in-the-ﬁeld. Further, unlike second generation Illumina sequencers, third-generation nanopore sequencers allows streaming and thus the processing can be performedwhile sequencing. This streaming capability can be exploited in an embedded system likethe nanopore-cluster, to perform data analysis on-the-ﬂy while the sequencer is operating,intending to produce the result soon after the sequencing run is completed.

The hardware architecture of the proposed system is in Fig. 6.1. The system comprised ofthe DNA Sequencer and the base-caller, Network Attached Storage (NAS) and the compu-tational nodes (head node and the worker nodes) are interconnected using Ethernet via alayer 2 Switch. The system is interfaced with the Internet or the Intranet through a router(layer 3 switch) supporting Network Address Translation (NAT). The function and details ofindividual components are elaborated below:

Sequencer and base-caller:

The sequencer can be one of the available nanopore sequencers- MinION, GridION or PromethION (Fig. 2.10). If the sequencer is a MinION or a Prome-thION, it must be connected to the corresponding base-calling unit (to the MinIT in caseof the MinION or the compute tower in case of the PromethION; see Fig. 2.11). It is thisbase-calling unit that is connected to the Ethernet switch through the available Ethernet in-terface on the case-calling unit. If the sequencer is a GridION or a MinION Mk1C, then the191HAPTER 6. SYSTEM INTEGRATION

Ethernet switch

NASHead node NAT RouterWorker node 1 Worker node 2 Worker node 3 Worker node n Sequencer and base-callerInternet/Intranet ...

Figure 6.1: Hardware architecture of the proposed systembase-calling unit is integrated with the sequencer and has a direct Ethernet interface.

NAS:

The NAS acts as the storage buﬀer between the base-caller and the computationalnodes (head node and the worker nodes). The sequencer and base-caller produces a batch ofdata every few minutes or so, which is copied to the NAS. The computational nodes fetchthese batches from the NAS into their local storage and process the data. The NAS can alsobe used as an archive for the sequencing data in case the raw data is later required. The NASis not necessarily a dedicated hardware NAS, alternately it can be virtual, i.e., the secondarystorage of the base-caller or the head node exposed as a network drive.

Head node:

The head node can be an SBC, a laptop or even a desktop. The head nodemonitors the NAS and assigns processing jobs to the worker nodes. The head node is also192.2. SYSTEM ARCHITECTURE OF NANOPORE-CLUSTERresponsible for the administration of worker nodes (controlling, updating software, deployingsoftware). Note that the head node is not expected to be on high CPU load and thus thehead node is extremely unlikely to freeze.

Worker nodes:

Worker nodes are SBCs. They are for processing the data and are controlledby the head node.

NAT router:

A router that supports NAT is not mandatory but is recommended. TheEthernet switch can be indeed connected to the local intranet or the Internet, however, ad-ministrators of centrally managed IT infrastructure may be reluctant due to potential risk ofswitching loops. The NAT router streamlines the integration by hiding the Ethernet switchbehind NAT. The NAT router (usually comes with a built-in ﬁrewall) has additional beneﬁtsin terms of security. It can be conﬁgured to only allow limited inbound traﬃc (for instance,only particular ports on the head node only) while allowing outbound traﬃc for Internetaccess.

The NAS is mounted on the base-calling unit, head node and all worker nodes. The sequenceroutputs the reads as raw signal data that are acquired by the base-calling unit. When a batchof reads are accumulated, the base-caller performs base-calling and produces two ﬁles – amulti-FAST5 ﬁle containing the raw signals (used to be a directory containing one FAST5 ﬁleper each reads last year) and a single-FASTQ containing the base-called reads. The batchsize is by default 4000 as set by ONT. When the base-calling of the batch is completed, themulti-FAST5 ﬁle (if single-FAST5, a tarball of the directory containing single-FAST5 ﬁles)and the FASTQ ﬁle is copied to the NAS. 193HAPTER 6. SYSTEM INTEGRATIONThe head node monitors the NAS for the recently copied data batches. Once such a freshdata batch is found, the worker node assigns the batch to a free worker node to be processed.If multiple worker nodes are free, the assignment is done randomly. If all worker nodes arebusy, it will be assigned as soon as a worker node becomes free.A sequencing run lasts for 48 hours (MinION or GridION) or 64 hours (PromethION). Thebase-caller will continuously produce batches from the read data produced by the sequencer.The head node will continuously monitor and assign the work to worker nodes.At the beginning of the sequencing run, all the pores in the ﬂow-cell of the sequencer arefunctional and data batches are produced at a faster rate. With time, the pores slowlydie and the rate of data batches produced will decrease. The objective of the proposedarchitecture is to ﬁnish processing soon after the sequencing run completes. At the beginningof the sequencing run when the sequencer outputs faster, there is no strict need for theworker nodes to keep-up processing at the same rate as the data is produced. Later when thesequencing rate decreases, the worker nodes can catch up.

Using an embedded system fornanopore data processing facilitates portability and is potentially cheaper due to the low costof SBC compared to an expensive server. However, processing on such an embedded system isat the same time challenging due to the low reliability of those SBCs - i.e., they occasionallyfreeze when under high computational load potentially due to a bug in the operating sys-tem . Most of the time, the freeze is detected by the watchdog timer and the device rebootsautomatically. In rare cases, the device completely freezes until manually power reset. Dynamic workload and scalability.

Sequencing yield diﬀers between MinION, GridION This was observed for Rock64 devices we used in our experimental setup.194.2. SYSTEM ARCHITECTURE OF NANOPORE-CLUSTERand PromethION. Library preparation techniques and the quality of the ﬂow-cell also aﬀectsthe sequencing yield. The rate of the sequencing output also varies. The embedded systemfor processing should support such variations. Thus, the optimal number of SBC required toprocess the data on-the-ﬂy varies and the system should be scalable to add or remove SBCsbased on the requirement and the budget. Thus, the scheduling of the processing jobs shoulddynamically scale with available resources.

Flexibility to support evolving workﬂows.

Nanopore bioinformatics workﬂows evolverapidly. The changes can be small as changing user-speciﬁed parameters to the program,moderate replacing of a particular tool with another, considerable as adding additional stepsto the workﬂow or signiﬁcant as using another workﬂow. The embedded system for processingshould be ﬂexible to support these imminent changes in the future.The above challenges were overcome by the proposed strategy called f5p which is detailedbelow. f5p - Lightweight Scheduler and Failure Handler f5p is a lightweight job scheduler with integrated failure handling capability designed to over-come the above-mentioned challenges in a nanopore data processing embedded system. f5p is composed of two components, namely, f5pd and f5pl which are explained below. f5pd : f5pd is the daemon program that runs on worker nodes. f5pd is launched at the startupof the worker node and runs indeﬁnitely while listening on a port. f5pd accepts connectionsfrom the head node ( f5pl below) and receives job scheduling commands from the head node.Job scheduling command is the location of a shell script and the location of a data batch asthe arguments. The shell script contains the commands for copying the data from the NAS tothe local storage, executing the steps in the data processing workﬂow and copying the results195HAPTER 6. SYSTEM INTEGRATIONback to the NAS. Once, a job scheduling command is received, f5pd executes the job on thenode and once the job is completed f5pd sends the status (success or failure with an errorcode) back to the head node. Once the job is completed. f5pd will continue to accept anotherjob scheduling command. f5pl : f5pl is the launcher that is run by the user on the head node. f5pl accepts the pipelineshell script and conﬁguration settings such as the directory path to be monitored and the IPaddresses of the worker nodes to be used. First, f5pl establishes connections to the workernodes and copy the pipeline shell script to all the nodes. Then, f5pl keeps monitoring thespeciﬁed directory and when a batch of reads is available, f5pl assigns the job to a free workernode. If the rate of data batches produced by the sequencer is high, f5pl will assign until allworker nodes are occupied. Then, f5pl waits until a worker node becomes free and assigns thenext waiting data batch accordingly. The process repeats until the end of the sequencing runcompletes. If a worker node is hung (restarted by the watchdog timer), f5pl waits until theworker node is alive and assigns the same data batch again. If N consecutive freezes occur,the worker node will be retired and will not be used for the rest of the sequencing run. In therare case where a device is totally hung (not restarted by the watchdog), f5pl will retire thedead node and will continue with the rest of the worker nodes. f5p is thus capable of handling failures due to unreliable SBC. Also, f5p is capable of dynam-ically assigning the jobs based on the available worker nodes. As f5p accepts a shell scriptthat can be easily customised by the user, it ensures ﬂexibility.Administration tasks such as updating software, deployment of new software and conﬁgura-tion management of the worker nodes are done by an existing IT Automation software (e.g. Ansible ) 196.3. EXPERIMENTAL SETUP

The architecture proposed in section 6.2.1 was realised in the sequencing facility at GarvanInstitute of Medical Research using 16 Rock64 SBCs as worker nodes. The cluster of these16 SBCs is referred to as the Rock64-cluster. Rock64-cluster placed alongside the nanoporesequencers is shown in Fig. 6.2.Figure 6.2: Rock64-cluster placed alongside the nanopore sequencers at Garvan Institute ofMedical ResearchEach Rock64 SBC (Fig. 6.3a) composed of a quad-core ARM processor, 4GB of RAM and197HAPTER 6. SYSTEM INTEGRATION64 GB eMMC storage [203] was running Ubuntu 16.04 LTS as the operating system. The16 SBCs were stacked using M2.5 Copper Cylinders (Fig. 6.3b) and were connected usingEthernet on to an HPE OﬃceConnect 1950 24G switch (Fig. 6.3c). A Synology DS3617xssystem with 5 TB storage was used as the NAS and a Ubiquiti 10G SFP+ EdgeRouter Inﬁnitywas used as the NAT router. A desktop computer with an Intel i7-4790 processor and 16 GBof RAM running Ubuntu 16.04 was used as the head node. Refer to appendix E for a step bystep guide on building the Rock64-cluster. f5pd and f5pl proposed in section 6.2.2.3 were implemented in C programming language.TCP sockets were used for communication between f5pd and f5pl . Multiple worker nodeswere handled in f5pl using multiple threads implemented with pthreads . f5pd was launchedat the startup of each worker node and was ensured for continuous running using systemd .TCP keepalive feature in the Linux kernel [204] was used to detect hung worker nodes (bydetermining if the connection is still up and running or if it has broken).The Rock-64 cluster was evaluated using a state-of-the-art nanopore methylation calling work-ﬂow (presented previously in Fig. 6.4) consisting of software tools Minimap2 , Nanopolish and

Samtools . The optimised version of

Minimap2 for eﬃcient memory capacity under chapter 4is used on the Rock64-cluster where the original

Minimap2 cannot run due to limited RAM oneach node.

Samtools was compiled for ARM architecture. A modiﬁed version of

Nanopolish was initially used on the Rock-64 cluster, which was eventually replaced with f5c developedin chapter 5. Original

Nanopolish which did not compile on ARM due to Intel speciﬁc SSEinstructions had to be modiﬁed and a bug that aﬀected ARM architecture had to be ﬁxed tosuccessfully run on ARM architecture. These ﬁxes are now on the original Nanopolish repos-itory , see appendix G. Later

Nanopolish was replaced with f5c as f5c for faster performanceand memory eﬃciency. Despite, Rock64 devices not having a GPU, f5c was around twice198.3. EXPERIMENTAL SETUP (a) A newly opened Rock64 SBC. Photographcredit: Martin A. Smith. (b) Rock64 SBCs stacked using M2.5 cylinders.Photograph credits: Martin A. Smith.(c) Rock64 SBCs connected using Ethernet Figure 6.3: Construction of the Rock64-cluster.199HAPTER 6. SYSTEM INTEGRATIONfaster compared to

Nanopolish .The aforementioned workﬂow was implemented as a shell script. This pipeline shell scriptruns on each worker node for each data as mentioned in section 6.2.2.3. The shell script ﬁrstcopies data from NAS to local eMMC storage and (extract if a single-fast5 tarball), performsthe commands of the aforementioned pipeline in the order presented in Fig. 6.4 and ﬁnallycopies the result back to the NAS. f5c/Nanopolish indexMinimap2 alignmentSamtools sortSamtools indexf5c/Nanopolish methylation callingReads (FASTQ) Raw signals (FAST5)

Sequencer and base-caller

Figure 6.4: Methylation calling workﬂow and its software tools

Ansible was used to automate administration task such as deploying software updates across200.3. EXPERIMENTAL SETUPworker node, conﬁguring settings across all worker nodes, performing maintenance operationsetc.

Ansible installed on the head node accesses worker nodes through password-less key-basedSSH. the ganglia monitoring system [205] was set up on the nodes to centrally observe thestate and the utilisation of worker nodes from the head node (screenshot in Fig. 6.5). Also, rsyslog coupled with loganalyzer [206] was conﬁgured to centrally view the worker node logsfrom the head node (screenshot in Fig. 6.6).The detailed steps to install and manage the Rock64-cluster are in E and the associated scriptsare in the GitHub repository at https://github.com/hasindu2008/nanopore-cluster . Sourcecode of f5p is available at the GitHub repository https://github.com/hasindu2008/f5p andmore details are in appendix E.2.

The architecture presented in section 6.2.1 is not limited to Rock64 devices, instead can beany other SBC. Two other SBCs, namely Jetson TX and Jetson Nano from NVIDIA, wereevaluated as an alternative to Rock64. However, due to prohibitive cost of building a completecluster of the Jetson SBCs, this evaluation was performed only using a single worker node,which is adequate as the multi-node architecture was already veriﬁed using Rock64 SBCs.Jetson TX2 development board (Fig. 6.7a) is composed of a hexa-core ARM processor, 256GPU cores, 8GB of memory (shared RAM for CPU and GPU) and 32 GB eMMC integratedstorage. A Samsung 1TB SSD drive was connected to the Jetson TX2 development boardusing the SATA interface. The system was running on Ubuntu 16.04.Jetson Nano development (Fig. 6.7b) is composed of a quad-core ARM processor, 128 GPUcores and 4GB of memory (shared RAM for CPU and GPU). A Sandisk Extreme 64GBmicroSD (A2 rating) and a Samsung 512 GB external SSD USB drive were used as thestorage. This system was running on Ubuntu 18.04.201HAPTER 6. SYSTEM INTEGRATIONFigure 6.5: Screenshot from Ganglia monitoring system202.3. EXPERIMENTAL SETUPFigure 6.6: Screenshot from LogAnalyzerThe evaluation was performed using the same methylation workﬂow used for the Rock64-cluster in section 6.3.2, with the use of f5c

CPU-GPU version instead of the CPU-only versionbeing the only diﬀerence.

A Nanopore MinION dataset of the T778 cancer cell-line of the human genome was used forthe evaluations presented in section 6.4. This dataset contained 771,325 reads with 11,393and 194,983 as the average and maximum of read lengths. The total yield was 8.78 Gbasesand total sizes of

FAST5 and

FASTQ ﬁles were 845GB and 17GB respectively. The FAST5ﬁles were of the single-FAST5 format. The dataset consisted of 198 batches of reads witheach batch having 4000 reads. 203HAPTER 6. SYSTEM INTEGRATION (a) Jetson TX2 development board(b) Jetson nano development board

Figure 6.7: NVIDIA Jetson development boards. Photograph credits: Hsu-Kang Dow.204.4. RESULTS

For the aforementioned T778 MinION dataset, the complete methylation calling workﬂowconsumed 5.88 hours on the Rock64-cluster. These 5.88 hours include the total processingtime (processing steps in Fig. 6.4) and all the overheads, i.e., overheads due to scheduling,ﬁle transfer to/from the NAS and tarball extraction. The tool used for methylation callingwas f5c . Note that, the time for the complete analysis on the Rock64-cluster (5.88 hours) isconsiderably lesser than the sequencing runtime on the MinION (typically 48 hours).During the analysis time (5.88 hours) mentioned above, ﬁve occasions of worker node freezeswere recorded (worker node freezes explained in section 6.2.2.3). Four of the freezes resulted inwatchdog time outs and eventually automatic restarts. However, the integrity of the analysiswas not aﬀected due to the failure handling mechanism of f5p that reassigned the data batchonce the worker node became alive (detailed in section 6.2.2.3). One freeze led to a totallydead device (not restarted by the watchdog), however, the analysis was continued with theremaining devices by f5p as mentioned in section 6.2.2.3. Note that, the 5.88-hour analysistime includes the time lost due to these freezes and dead devices.The summary of execution details discussed above for the T778 MinION dataset is listedin the ﬁrst row of Table 6.1. In Table 6.1, the ﬁrst column describes the sample that issequenced, the second column indicates the nanopore sequencer used (whether MinION, Gr-dION or PromethION), the third column lists the number of data batches in the dataset,the fourth column gives the typical sequencing runtime (48h for MinION/GridION and 64hfor PromethION), the ﬁfth column indicates the software used for methylation calling ( f5c or nanopolish ), the sixth column gives the total time for the execution of the methylation callingworkﬂow on the Rock64-cluster, the seventh column gives the number of worker node freezesthat were detected by the watchdog timer leading to automatic restarts and the eighth column205HAPTER 6. SYSTEM INTEGRATIONgives the number of worker nodes retired due to complete freezes or consecutive failures.While the above T778 dataset was used for performing thorough evaluation and benchmarks,several other nanopore datasets were processed on the Rock64-cluster and their executiondetails are summarised in Table 6.1 from the second row onwards. These details were collectedwhile using the Rock64-cluster for in-house data processing of research data samples. Somedatasets in Table 6.1 have been processed using Nanopolish instead of f5c because those wereprocessed before the development of f5c as mentioned in section 6.3. The last two rows ofTable 6.1 are for the same dataset where one execution was using f5c and the other using nanopolish . Observe that f5c performance is superior (45.08 hours for the complete workﬂow)compared to

Nanopolish (61.58 hours for the complete workﬂow), in spite of 3 worker nodesbeing retired in the f5c execution compared to only 2 retired worker nodes in the

Nanopolish execution. Therefore, the processing time observed for the datasets processed using nanopolish would be improved if executed using f5c .Table 6.1: Execution results of several nanopore datasets of the human genome on the Rock64-cluster

Sample Sequencer Databatches Sequencingrun time(h) f5c or Nanopol-ish

Processingtime (h) Noderesets Nodere-tires

T778 - lipsarcoma MinION 198 48 f5c

Nanopolish

Nanopolish f5c

Nanopolish

Nanopolish f5c

Nanopolish

The architecture proposed in section 6.2 is generic, i.e., worker nodes are not limited to Rock64SBCs. Two alternative SBCs to Rock64 were benchmarked, namely, Jetson TX2 and JetsonNano. Both Jetson SBCs are from NVIDIA and are equipped with GPUs making it possible tofully harness the GPU accelerated component of f5c . The benchmarking was performed onlyon a single Jetson TX2 and a single Jetson Nano due to prohibitive costs. The performanceof the methylation calling workﬂow on each of these SBcs is compared to the performance ona single Rock64 SBC in Fig. 6.8.The x-axis of the horizontal bar chart in Fig. 6.8 denotes the time in hours. The bars denotethe sum of execution times for all the 198 batches of the T778 MinION dataset and diﬀerentcolours denote the breakdown of the execution time for each step in the workﬂow. JetsonTX2 was the fastest, consuming 12.27 hours, followed by Jetson Nano consuming 31.44 hours.Rock64 was the slowest consuming 60.31 hours. Thus, a single Jetson TX2 was around 5times faster when compared to a single Rock64 SBC and a single Jetson Nano was around 2times faster than a single Rock64 SBC.On each SBC, the major portion of the time was contributed by the

Minimap2 alignmentstep (49%-60%) followed by the methylation calling step (33%-43%). Methylation calling wasperformed using f5c , the CPU-only version on Rock64 and the CPU-GPU version on theJetson SBCs. f5c index contributed less than 7% of the total time and the times for

Samtoolssort and

Samtools index were very small (around 1%).Fig. 6.9 shows the workﬂow execution time spent on each data batch of the 198 data batchesof the T778 dataset. Fig. 6.9a is for a single Rock64, Fig. 6.9b is for a single Jetson TX2 andFig. 6.9c is for a single Jetson Nano. The x-axis of each bar chart denotes the data batchnumber ranging from 1 to 198. The y-axis is the time in minutes where the bars representthe time spent on each data batch for the methylation calling workﬂow. Diﬀerent colours207HAPTER 6. SYSTEM INTEGRATION

6% 49%

60% 43%42%

Jetson TX2

Jetson Nano Time (hours) f5c Index Minimap2 Samtools Sort Samtools Index f5c call-methylation

Figure 6.8: Comparison of Jetson TX2, Jetson Nano and Rock64 based on the single SBCexecution times for the whole datasetin bars denote the breakdown of the time for diﬀerent steps in the workﬂow. The averagetime for processing a data batch was 18.35 minutes for the Rock64 SBC, 3.72 minutes forthe Jetson TX2 and 6.39 minutes for the Jetson Nano. Out of the 18.35 minutes for theRock64, the major portion was consumed by

Minimap2 alignment (8.95 minutes) followed by f5c methylation calling (7.90 minutes). Similarly, 1.89 and 1.56 minutes for Jetson TX2 and5.73 and 3.14 minutes for Jetson Nano were recorded for

Minimap2 and f5c , respectively.

As mentioned in section 6.1, nanopore sequencers are capable of streaming the sequencingdata and thus it is possible to process data on-the-ﬂy. This section demonstrates the proof ofconcept of performing data analysis on-the-ﬂy (real-time) using the architecture presented in208.4. RESULTS

29 36

50 57

71 78

92 99

113 120

134 141

155 162

176 183 t i m e ( m i n ) batch numberf5c index Minimap2 Samtools sort Samtools Index f5c call-methylation (a) On a single Rock64 development board

22 29 36 43 50 57 64 71

78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 t i m e ( m i n ) batch number f5c index Minimap2 Samtools sort Samtools Index f5c call-methylation (b) On a single Jetson TX2 development board

43 50 57 64 71 78 85 92

99 106 113 120 127 134 141 148 155 162 169 176 183 190 t i m e ( m i n ) batch numberf5c index Minimap2 Samtools sort Samtools Index f5c call-methylation (c) On a single Jetson nano development board Figure 6.9: Execution time on individual SBCs per each batch in the dataset209HAPTER 6. SYSTEM INTEGRATIONsection 6.2.How the sequencing rate varies over time is shown in Fig. 6.10 for MinION (blue curve),GridION (orange curve) and PromethION (yellow curve). The x-axis denotes the time inhours and the y-axis denotes the cumulative number of bases sequenced (in Gbases) overtime. Observe how the sequencing rate (gradient of the curve) is high at the beginning, whenthen slowly reduces and eventually becomes zero.Fig. 6.10 also plots the cumulative number of bases possible to be processed using a singleRock-64 SBC (purple dashed line), a single Jetson TX2 (blue dashed line) and a single JetsonNano (green dashed line). For these plots, the y-axis is now the number of gigabases analysed.The gradient of each plot is calculated by dividing the total workﬂow execution time in Fig.6.8 for the corresponding SBC by the total number of bases in the dataset. Observe thata single Rock64 device is barely adequate to keep up with the MinION curve. At ﬁrst, theanalysis lags when the sequencing rate is high, which then catches up when the sequencingrate drops. Observe that, a single Jetson Nano can easily keep up with a MinION and aJetson TX2 is barely adequate to keep up with the GridION curve. Fig. 6.10 also plots thecumulative number of bases possible to be processed using clusters made of each SBC., i.e.,16 Rock64 devices (purple dotted line), 4 Jetson TX2s (green dotted line) and 8 Jetson Nanos(blue dotted line). The gradients of these lines are equal to the product of the gradient for asingle SBC and the number of devices. The number of SBCs in a cluster has been selectedso that the cluster can more than adequately keep up with a PromethION ﬂowcell. Suchan extra margin between the analysis and the sequencing yield curves will smooth on-the-ﬂyprocessing while allowing for disruptions such as device freezes.Note that the x-axis in Fig. 6.10 shows only the ﬁrst 45 hours of the sequencing run as thecurve is almost ﬂat by this time, despite the sequence run of a MinION/GridION being 48hours and PromethION being 64 hours. Also, the curves for the sequencers in Fig. 6.10 arebased on typical average values. The exact curve can vary based on factors such as the quality210.4. RESULTSof the sample and ﬂow cell and also will change with technology improvements. Nevertheless,the presented proof of concept technique for estimating the number of SBCs for analysingdata on-the-ﬂy remains unaﬀected.

Time (hours) N o . o f G ba s e s s equen c ed o r ana l ys ed MinIONGridIONPromethIONRock6416 x Rock64JetsonTX24 x JetsonTX2JetsonNano8 x JetsonNano

Figure 6.10: The comparison of the sequencing rate with the data analysis rate over theduration of the sequencing run

The performance of the Rock64-cluster is compared to the performance of an HPC in Fig.6.11. The Rock64-cluster performed the methylation calling pipeline using f5p as explained insection 6.3. The HPC was a server with 28 Xeon E5-2680 cores, 512GB RAM and 10 NVMe211HAPTER 6. SYSTEM INTEGRATIONSSD drives in RAID conﬁguration. The HPC executed the same methylation calling workﬂowas in Fig. 6.4 with original

Minimap2 and original

Nanopolish . Observe that the time spenton the Rock64-cluster (5.88 hours) is comparable to the time consumed on HPC (4.81 hours).Importantly, the time for Rock64-cluster includes the overheads for copying to/from the NASand extracting tarballs whereas the HPC processed the dataset that was already placed onthe fast local SSD RAID drives.Comparing the cost and size of the Rock64-cluster with that of the server (about 10 timesapproximately), it is surprising that the performance is similar. Further analysis of thissurprising phenomenon is in chapter 7.

11% 21% 66%0.00 1.00 2.00 3.00 4.00 5.00 6.00Rock64clusterHPC

Time (hours)

Nanopolish Index Minimap2Samtools Sort Samtools IndexNanopolish call-methylation f5p workflow (includes overheads)

Figure 6.11: Comparison of proposed architecture on the Rock64-cluster with the originalpipeline running on an HPC 212.5. DISCUSSION

The evaluation results presented on this chapter were based on datasets that were alreadyresiding on the NAS (datasets of previously sequenced samples), i.e., the whole dataset (alldata batches) was available on the NAS when the workﬂow execution was started on theRock64-cluster. While being adequate to demonstrate the proof of concept for on-the-ﬂy (real-time) data processing, the future work could focus on implementing the scripts that automatesthe data transfer in real-time from the sequencer to the NAS. In fact, this implementationwork is currently under progress as an undergraduate student project ( https://github.com/sashajenner/realf5p ) and is not claimed as a part of this thesis.

The proposed architecture in this chapter can also applicable to a cluster of mobile phonesconnected through Wi-Fi. The feasibility of performing the methylation calling workﬂowon an Android mobile phone was evaluated in an experimental environment (Fig. 6.12) asdescribed in Appendix F. The development of a proper Android Application was undertakenby an undergraduate project group and is described in the pre-print at [28]. Also, the Wi-Ficluster implementation is under progress by the same group. The development of the Androidapplication or the Wi-Fi cluster is not claimed under the thesis.

The proposed architecture that realises portable real-time nanopore-based methylation de-tection systems has potential applications such as tissue classiﬁcation, diagnostic tests, en-213HAPTER 6. SYSTEM INTEGRATIONFigure 6.12: Methylation calling workﬂow on an Android mobile phonevironment, age, etc. The basis for such an application is in Fig. 6.13 that shows how themethylation frequency changes with the number of bases sequenced for ﬁve diﬀerent loci onthe human genome (TP53, MGMT, BRCA1 and BRCA2 that are genes and chromosome22). The number of bases sequenced (x-axis) is indicative of the time. As observed in theleft plot in Fig. 6.13, most of the CpG sites in the genomic locus are covered after around2 gigabases of sequencing data (the gradient of the curves decreases). The right plot in Fig.6.13 shows that the methylation frequency across various loci stabilises after around 2 Gbasesof sequencing data. 214.6. SUMMARY

Number of gigabases sequenced M e t h y l a t i on f r equen cy Number of gigabases sequenced C a ll ed s i t e s ( l og sc a l e ) TP53MGMTBRCA1BRCA2CHR22

Figure 6.13: Potential applications of real-time methylation calling. Left graph - the variationof the number of called sites and the methylation frequency over the number of gigabasessequenced

A system architecture was proposed for performing a popular DNA methylation detectionworkﬂow on a prototype embedded system. The workﬂow was realised on the proposed archi-tecture by integrating the optimised software versions from previous chapters. The proposedarchitecture was evaluated using oﬀ-the-shelf single-board computers and was demonstratedthat performing real-time analysis of nanopore sequencing is possible on an embedded system.Also, it was shown that the performance of the prototype embedded system is surprisinglysimilar to the performance on an HPC. The prototype system is fully functional and is inte-grated into the nanopore sequencing facility at the Garvan Institute of Medical research forperforming methylation calling of the samples. The system architecture and the associatedsoftware for building a replica of the prototype are open-sourced at https://github.com/hasindu2008/nanopore-cluster and https://github.com/hasindu2008/f5p .215HAPTER 7. I/O OPTIMISATIONS

Chapter 7

Optimisation of Nanopore SequenceAnalysis for Many-core CPUs

This chapter is prepared to be submitted as a publication in an ACM/IEEE journal/confer-ence:

H. Gamaarachchi , H. Saadat, S. Parameswaran, "Optimisation of Nanopore SequenceAnalysis Software for Many-core CPUs", to be submitted [in progress], 2020.Nanopore sequencing is a third generation (the latest) genome sequencing technology. Thesemodern advances in computational genomics are reshaping healthcare through life-savingapplications in medicine and epidemiology, where quick turn-around time of results is critical.Nanopore sequence analysis software tools are ineﬃcient in utilising the computing poweroﬀered by modern High Performance Computing systems equipped with many-core CPUs andRAID systems. In this chapter, we present a systematic experimental analysis to identify thepotential bottlenecks, which reveals that the primary bottleneck is the thread-ineﬃcient HDF5library used to load nanopore data. We propose multiple optimisation strategies suitable216.1. INTRODUCTIONfor diﬀerent practical scenarios to alleviate the bottleneck: 1) a new ﬁle format that oﬀersup to ∼ × I/O performance improvement; and, 2) a multi-process based solution for thescenario when using a new ﬁle format is not possible, that oﬀers up to ∼ × I/O performanceimprovement.We demonstrate the eﬃcacy of our optimisations by integrating them to the popular

Na-nopolish toolkit. Our experiments using a representative nanopore dataset demonstrate thatthe proposed optimisations enable improved scaling of overall-performance with the numberof threads ( ∼ . × for 4 vs. 32 threads). Moreover, they also lead to overall-performanceimprovement ( ∼ × for 4 threads and ∼ . × for 32 threads) and improved CPU utilisation(from 69% to 99% for 4 cores and from 22% to 85% for 32 threads) for a given number ofthreads, when compared to the original Nanopolish . Computational genomics has turned a new chapter in medical sciences and epidemiology[207, 208]. It enables promising applications such as accurate disease diagnosis, identifyinggenetic predisposition, and precision medicine [209].

Genome sequencing converts the geneticand biological information encoded in DNA molecules into computer readable data, whichis typically hundreds or thousands of gigabytes in size.

Nanopore sequencing is a leadingthird-generation (the latest) genome sequencing technology [189]. Computational genomicssoftware tools analyse the huge amount of data generated by genome sequencing to extractmeaningful information for the above applications.Quick turn-around time of results in such applications is highly desirable. For instance, quickdiagnostics can instantiate immediate treatments. Moreover, rapid results would enable fastertracking of disease spreading in epidemiological applications such as the ongoing Corona virusoutbreak [210]. However, to analyse the enormous amount of data with high speed, genomic217HAPTER 7. I/O OPTIMISATIONScomputation software tools demand massive computing time. Thus, scientists typically useHigh Performance Computing (HPC) systems to run these software tools [211].A modern HPC system oﬀers signiﬁcant computational power through many-core CPUs thatare to be exploited through parallelism. The major advantage in such systems when comparedto an ordinary personal computer is the availability of number of cores in the CPUs. Moreover,HPC systems have RAID storage composed of many disks for higher I/O throughput withthe added beneﬁt of reliability [212].

Unfortunately, the existing software tools for nanoporesequencing are generally not capable of eﬃciently utilising the large number of cores availablein many-core HPC systems, and thus fail to take maximal advantage of the available computingpower.

Consequently, the overall execution time of the applications on an expensive HPCsystem may not improve signiﬁcantly when compared to its execution on a less expensiveworkstation or a personal computer (refer to chapter 6).

In this chapter, we present softwareoptimisations in nanopore software tools to enable them to take maximal advantage of thecomputing power oﬀered by modern many-core HPC systems.

To demonstrate the problem mentioned above, we present a motivational example using

Na-nopolish [104], which is a popular state-of-the-art nanopore raw data analysis toolkit [213].

Motivational Example: We executed the call-methylation tool in

Nanopolish toolkit ona representative dataset . The experiment was performed on a high-end HPC system with36 Intel Xeon cores using diﬀerent number of threads. The graph in Fig. 7.1a plots theexecution time (y-axis) for Nanopolish against the number of threads (x-axis). We observethat when the tool is executed with four threads, the execution-time is nearly 10 hours. Theexecution-time does not improve signiﬁcantly with increasing number of threads, and there islittle improvement beyond 16 threads. Refer to appendix H for another example on another dataset See the experimental setup under results for details of the dataset. See system S1 in Table 7.4 for the speciﬁcation of the HPC system.218.1. INTRODUCTION E x e c u t i o n t i m e ( h o u r s ) Number of threads (a) Execution time C o r e h o u r s ( t i m e x t h r e a d s ) C P U U t ili s a t i o n ( % ) Number of threadsCPU Utilisation (%) Core hours (b) CPU utilisation & core-hours

Figure 7.1: Variation of (a) execution time, (b) CPU utilisation and core-hours in original

Nanopolish with the number of data processing threads.To analyse further, Fig. 7.1b plots the CPU utilisation (left y-axis) and the core-hours (right y-axis) for each case in the above experiment. The CPU utilisation, for executionwith four threads, is less than ideal (69%). Moreover, as the number of threads increase, theCPU utilisation decreases signiﬁcantly. Speciﬁcally, when executed with 32 threads, the CPUutilisation is as low as 22%. We also observe that the core-hours (which should be constantwith the number of threads in an ideal case) increase signiﬁcantly, and hence depicting thatemploying greater number of threads is ineﬃcient and not highly beneﬁcial.Thus, procuring an HPC system with a higher number of CPU cores might not be beneﬁcialfor achieving quick turn-around time of results for nanopore software tools, and there is aneed for software optimisations in nanopore software tools to exploit the available resourcesin HPC systems. To this end, in this chapter, we ﬁrst present a systematic experimental anal-ysis to identify the potential bottlenecks that hinder the eﬃcient utilisation of CPU resourcesin nanopore software tools. Then we present multiple optimisations–suitable for diﬀerent CPU utilisation is calculated as in results. Core-hours is inspired by the common term man-hour . It is equal to the product of thenumber of hours and the number of cores/threads [214].219HAPTER 7. I/O OPTIMISATIONS practical scenarios–to overcome these bottlenecks and enable performance improvements.

Ourexperiments using the state-of-the-art

Nanopolish toolkit on HPC systems demonstrate thatour proposed optimisations enable improved CPU utilisation and hence improved performancescaling with the number of threads (Fig. 7.9 and 7.11). Moreover, they also enable improvedperformance for a given number of threads with respect to the original

Nanopolish . For ex-ample, for 32 threads, the CPU utilisation increases up to ∼

85% (which was 22%), and a ∼ . × speed up is achieved when compared to the original Nanopolish . We believe that suchimproved performance will facilitate fast diagnostics and rapid epidemic response.

Contributions:

The key novel contributions of this chapter can be summarised as follows.• We, for the ﬁrst time, present a systematic analysis to identify the potential bottlenecksin nanopore software tools. The analysis reveals that the primary bottleneck is caused bya limitation in an underlying library (HDF5) that serialises disk accesses from multiplethreads (Section 7.3).• We propose an alternate ﬁle format (SLOW5) that alleviates the bottleneck by allowingrandom accesses from multiple parallel threads. The proposed ﬁle format is designed byexploiting the domain knowledge of nanopore sequencing (Section 7.4.1).• In some scenarios, it may not be practically possible to use a new ﬁle format. Therefore, wepresent a second solution based on multi-processes. This solution alleviates the bottleneckwithout requiring any modiﬁcation to the existing ﬁle format (Section 7.4.2).• We demonstrate that the new multi-FAST5 ﬁle format–which is projected as a replacementof the existing FAST5 ﬁle format by the research community–also suﬀers from the samebottleneck, thus our proposed SLOW5 format is superior. Moreover, our multi-processbased solution is also eﬀective in alleviating the bottleneck in multi-FAST5 (Section 7.5.5). chapter Organisation:

Section 7.2 discusses the background and related work. Section 7.3220.2. BACKGROUNDelaborates our analysis for identifying bottlenecks and its explanation. Our proposed optimi-sations and solutions are presented in Section 7.4. Section 7.5 presents our experimental setupand results. Finally, Section 7.6 is the discussion and the chapter is concluded in Section 7.7.

Two types of I/O: synchronous I/O (blocking I/O) and asynchronous I/O (non-blocking I/O)are in the context of random accesses (opposed to sequential/streaming access) are discussedin subsections 7.2.1.1 and 7.2.1.2, respectively.

Synchronous I/O is convenient to be programmed, and such programmed code are legible.Thus synchronous I/O is the most popular and predominantly used amongst typical program-mers. Following is a simpliﬁed account of how random disk requests are served in a modernoperating system.Consider a single-threaded program that requests I/O using standard read or write systemcalls (buﬀered read/write API calls such as fwrite , fread , fprintf , getline , etc are eventuallymapped to these system calls). These system calls are synchronous calls which return whenthe requested data is read from the disk.In Fig. 7.2, the user-space thread is performing a synchronous I/O request. The operatingsystem receives the system call and queues the disk request in its disk request queue. Mo-mentarily, the user-space thread is put to sleep by the operating system, since a disk requestis expected to take hundreds of thousands of CPU clock cycles. The operating system will221HAPTER 7. I/O OPTIMISATIONS Single-threaded disk access

User space thread O/S disk request queue Disk controller and disks

SynchronousI/O requestThread put to sleep until request is fulfilled - Can be HDD or SSD- Singe disk or RAID array • Time to server a single disk request = t • Total number of I/O requests = n • Total time spent on I/O = tn Figure 7.2: Elaboration of synchronous I/Oschedule the disk request (assign to the disk controller) based on policies and priority levelsimposed. The disk controller will perform the operation and the operating system will wakeup the thread, once requested data reading is completed. If the disk system has a single disk,eﬀectively one request can be served at a time . If the disk system has K disks, up to K requests may be served simultaneously, depending on the RAID level; i.e. K simultaneousparallel reads are possible on a RAID 0 system with K disks.Consider a program with a single thread requesting I/O as shown in Fig. 7.2. Let t be theaverage disk request service time (from the time of the system call to when the thread iswoken up). Let a single thread be requesting n synchronous disk reads sequentially. Despitethe number of requests n , the total time spent on disk reading T is : T = t × n .Now consider a program with multiple threads requesting I/O (I/O threads) as shown inFig. 7.3. One thread put to sleep due to an I/O request, does not limit other threads fromrequesting I/O . Thus, if we launch K I/O threads and if the disk controller can serve K requests in parallel, the total time for disk reading is T : T = t × nK The scenario in Fig. 7.3 is achieved by programs where threads are having an independentcode path - where each processing thread independently performs disk accesses on demand.However, in programs that perform data processing batch by batch, where one single thread as the discussion is about random accesses, disk request merge operations are infrequent222.2. BACKGROUND Multi-threaded disk access

User space thread O/S disk request queue Disk controller and disks

I/O request - Can be HDD or SSD- Singe disk or RAID array

User space thread User space thread

I/O request I/O request • Time to server a single disk request = t • Total number of I/O requests = n • Number of I/O threads = K • Total time spent on I/O = tn/K

Figure 7.3: Elaboration of multi threaded synchronous I/Oreads a batch of data from the disk and assigns to multiple processing threads to be processedin parallel, it is the scenario in Fig. 7.2.

Asynchronous I/O is pertinent to highly responsive applications like web servers and databaseservers. In asynchronous I/O, the system call that requests I/O will return immediately. Theuser-space thread won’t be put to sleep and this can continue to submit another I/O requestor execute some other task while the disk request is being served.Consider the asynchronous I/O example in Fig. 7.4 where a single thread submits multipleI/O requests to the operating system simultaneously. In Fig. 7.4, the single user space threadsubmits K I/O requests in parallel. Then the thread can either poll or wait for a notiﬁcationfrom the operating system for I/O request completion. Assume we have n total disk requeststo be performed. If the disk system can perform K accesses in parallel and if the time for asingle disk accesses is t ( K parallel accesses take t as well), the total time T = t × nK . Notethat the time is the same as for Fig. 7.3.This type of asynchronous I/O is suitable when a program performs reading data and process-223HAPTER 7. I/O OPTIMISATIONS K asynchronousI/O requests

Asynchronous I/O

User space thread O/S disk request queue Disk controller and disks

Thread wait until all K requests are fulfilled - Can be HDD or SSD- Singe disk or RAID array • Time to server K parallel disk requests = t’ • Total number of I/O requests = n • Total time spent on I/O = t’n/K

Figure 7.4: Elaboration of asynchronous I/Oing batch by batch where one thread performs I/O and then assigns multiple threads or to anaccelerator card (eg: GPU to be processed). This is in contrast to independently processingthreads we discussed under synchronous I/O above, as threads need to converge in this case.Asynchronous I/O can be performed by: (1) native synchronous I/O system calls in theoperating system or (2) a library that emulates asynchronous I/O through a thread pool thatuse synchronous I/O system calls in the operating system.From the two methods above, (1) allows ‘real’ asynchronous I/O, but only if supported by theoperating system. Early Linux kernels (before version 2.5) did not have native asynchronousI/O systems calls, however, they are available in modern Linux kernels starting from version2.5 onwards [215]. Despite that, asynchronous I/O implementation in the Linux kernel hasbeen a controversial topic amongst Linux developers [216], is complicated [215], have variousdrawbacks [215, 217] and does not support certain ﬁle systems such as NFS [218]. GNU CLibrary (Glibc) does not provide wrapper functions for asynchronous I/O system calls [219].Instead the programmers must use low-level system calls which are not easy and non-portable(Architecture speciﬁc). Third party libraries such as libaio [220] which uses Linux nativeasynchronous I/O system calls have attempted to provide an abstract interface.An example of the method (2) above is the current Portable Operating System Interface(POSIX) compatible asynchronous I/O (AIO) library provided by Glibc. POSIX AIO im-224.2. BACKGROUNDplementation in GlibC is provided in the user-space and uses multiple threads [221]. Thedeveloper of POSIX AIO have admitted that their approach is expensive and have scalabilityissues which are expected to be ﬁxed in the future through a state-machine-based implemen-tation of asynchronous I/O [221]. Further, POSIX AIO is not implemented in all systems (eg:Windows subsystem for Linux)While the POSIX AIO is suitable for typical I/O loads, the programmers can also spawnmultiple I/O threads per batch and assign the disk accesses amongst them. This is suitable ifthe batch size is big and the thread spawning time is small compared to the I/O time of thebatch.

Genomics:

DNA is a molecule composed of a long strand of millions of units called nucleotidebases (or simply called bases ). Genome sequencers read a DNA strand in relatively smallerfragments (around 10,000 bases long in nanopore) and converts them into digitised data,termed as reads [189] in the domain of bioinformatics. In this chapter we refer to themexplicitly as genomic reads to avoid confusion with disk reads.

Nanopore Sequence Analysis:

Nanopore sequencing is a leading third-generation (thelatest) sequencing technology [189]. Oxford Nanopore Technologies (ONT) is the companythat produces nanopore sequencers. A nanopore sequencer is composed of an array of pico-ampere range current sensors that measure the ionic current disruptions when DNA fragmentspass through nanometer scale protein pores [189, 222]. The raw sensor output for a genomicread is a time series current signal and is referred to as raw signal or raw data . ONT stores theraw signal and other metadata (e.g., sampling frequency) in a ﬁle format called FAST5 [223].FAST5 is essentially a Hierarchical Data Formats 5 (HDF5) [95] ﬁle with a speciﬁc schemadeﬁned by ONT. The only existing library for accessing HDF5 format is the oﬃcial library225HAPTER 7. I/O OPTIMISATIONSdeveloped and maintained by the non-proﬁt organisation HDF Group [224, 225]. Nanopolish : Raw data analysis toolkits analyse the sequencer outputs using complex algo-rithms and extract meaningful information.

Nanopolish is currently a popular state-of-the-artnanopore raw data analysis toolkit.

Nanopolish is used in a number of genomic workﬂowssuch as methylation detection [104], variant detection [226], draft genome polishing [56, 103]and real-time molecular epidemiology for the ongoing Corona Virus outbreak [210].

Nanopo-lish is written in C/C++ and supports multi-threaded execution through OpenMP. It is anopen-source toolkit with a large and complex codebase [104, 227].

Previous Work on Optimising

Nanopolish : Nanopore sequence analysis is a relativelynew ﬁeld that only emerged in the last decade. Thus, optimisation eﬀorts to improve per-formance of nanopore software tools are rare. In

Nanopolish , calculation of log likelihoodratio is a predominantly used CPU intensive computation kernel [104]. To reduce CPU timefor log likelihood computation,

Nanopolish authors have already employed a fast table-drivenlog-sum implementation established elsewhere in [228]. However, none of the existing workshave focused on improving the overall performance of

Nanopolish on HPC systems with many-cores and Redundant Array of Independent Disks (RAID). Our proposed optimisations areorthogonal to the methods discussed above.

Previous Work on Optimising Sequence Analysis:

Several optimisation eﬀorts existfor the second generation sequencing software (also known as

Next Generation Sequencing )[24, 229–233]. However, software used for nanopore sequencing (third-generation) is distinctfrom the ﬁrst and second generations [234]. Nanopore technology involves processing rawsignal data, which is not the case for ﬁrst and second generation.In this chapter, we for this ﬁrst time, identify the major causes behind the ineﬃcient resource-utilisation by nanopore software tools and present multiple optimisations to alleviate thoseissues. 226.3. IDENTIFICATION AND EXPLANATION OF BOTTLENECK

The motivational example in Section 7.1 revealed that

Nanopolish is unable to eﬃciently utilisemultiple cores in the system. There can be two reasons for an application to be unable toutilise parallel resources. These are: 1) data processing bottleneck; and/or 2) I/O bottleneck.In this section, we identify and explain that the primary reason of the under-utilisation is I/Obottleneck.

We employed performance monitoring and proﬁling tools in the motivational example setup,to hypothesise the causes of ineﬃcient resource-utilisation and performance.

Hypothesis-1: The performance of the software tool is bounded by ﬁle I/O.

We observedthrough htop utility in Linux that majority of

Nanopolish threads are in the ‘D’ state. The‘D’ state is deﬁned as the ‘state of the process for disk sleep (uninterruptible)’ [235]. Thisleads to our ﬁrst hypothesis that the software tool is bounded by ﬁle I/O. In fact,

Nanopolish incurs a large number of random disk accesses when reading millions of FAST5 ﬁles (basedon HDF5) in a nanopore dataset . Hypothesis-2: The ﬁle I/O bottleneck is caused by the HDF5 library and not by the limitationof physical disks.

We observed the disk usage statistics through the iostat utility to ﬁndthat disk system is not fully utilised (i.e., the observed number of disk reads per secondwas around 100, while the particular disk system could handle more than 1000 IOPS). Thisimplies that I/O bottleneck is not due to the limitation of physical disks to serve data fast A nanopore dataset of a single genome sample contains millions of genomic reads (frag-ments of DNA), and each genomic read is stored in a separate FAST5 ﬁle. Thus, accessingmillions of such genomic reads incurs millions of random disk accesses (opposed to sequen-tial/streaming access) 227HAPTER 7. I/O OPTIMISATIONS E x e c u t i o n t i m e ( h o u r s ) Number of threadsBAM access FASTA access FAST5 access Data processing

Figure 7.5: Decomposition of time for individual components in restructured

Nanopolish .enough to saturate the processor. To investigate further, we proﬁled

Nanopolish with

IntelVtune under concurrency proﬁling . It reveals that the majority of the ‘wait time’ is due to aconditional variable (synchronisation primitive) in the underlying library called HDF5 library(Hierarchical Data Format 5–used to access raw nanopore data stored in FAST5 ﬁle). A closerlook into the HDF5 library revealed that the thread-safe version of the HDF5 library serialisesthe calls for disk read requests [236]. Thus, we hypothesise that the reduced CPU utilisationis caused by the disk requests being serialised by the HDF5 library, consequently causing thebottleneck and limiting the utility of a multi-disk RAID system.

To verify the above identiﬁed cause of the bottleneck, we performed a deeper analysis. For this,we ﬁrst restructured

Nanopolish such that wall-clock time spent on I/O operations and dataprocessing can be separately measured to determine the time spent on individual componentsin the program.We run the restructured

Nanopolish with various number of threads (for FAST5 access and228.3. IDENTIFICATION AND EXPLANATION OF BOTTLENECKdata processing). The results are presented in Fig. 7.5. The x-axis in the ﬁgure represents thenumber of threads used and the y-axis represents the total execution time. Diﬀerent coloursin the bars (see legend) denotes the decomposition of the total execution time into diﬀerentcomponents .We observe from Fig. 7.5 that: 1) the contribution by the BAM access and FASTA accessto the overall execution time is negligible; 2) a major portion of the time is consumed byFAST5 access (patterned brown); 3) time consumption of FAST5 access (patterned brown)does not improve with increasing number of threads; and, 4) data processing time improveswith increasing number of threads (solid blue). This conﬁrms that the bottleneck is causedby ﬁle I/O and not because of any data processing bottlenecks. In this subsection, we explain the major limitation in HDF5 library that prevents eﬃcientparallel accesses, consequently causing the bottleneck.

HDF5 Library and its Limitations in Thread Eﬃciency:

HDF5 library uses syn-chronous I/O calls and even the latest HDF5 implementation (HDF5-1.10) does not supportasynchronous I/O . This, by itself, is not an issue as multiple synchronous I/O operations canbe performed in parallel using multiple I/O threads to exploit the high throughput of RAIDsystems in HPC systems. Fig. 7.3 demonstrates how multiple I/O threads can be used to FASTA access refers to random access to reference genome (stored in FASTA ﬁle format)performed using faidx component in htslib library. BAM access refers to sequential accessperformed through htslib library to the genomic alignment records (stored in BAM ﬁle format) In synchronous I/O calls, the OS, upon receiving the call puts the user-space thread tosleep and the thread can no longer submit I/O requests until the disk reading is completedand woken by the OS. Conversely, asynchronous I/O system calls return immediately withoutthe thread being put to sleep and the thread can continue to submit another asynchronousrequest. 229HAPTER 7. I/O OPTIMISATIONSperform parallel disk accesses using synchronous I/O. Suppose the disk system has K disks,up to K requests may be served simultaneously depending on the RAID level; i.e., K simul-taneous parallel reads are possible on a RAID 0 system with K disks. Let t be the averagedisk request service time (from the time of the system call to when the thread is woken up).For a program that launches K I/O threads and if the disk controller can serve K requestsin parallel, the total time for n disk reads is T = t × nK .However, the HDF group (that maintains the HDF5 library) mentions that the thread-safeversion of the HDF5 library is not thread eﬃcient and that it eﬀectively serialises the callsfor disk read requests [236]. The global lock in the thread safe version of the HDF5 librarycreates limitations. Following is an extract from the FAQ section of the HDF web site [236].“Users are often surprised to learn that (1) concurrent access to diﬀerent datasets in a singleHDF5 ﬁle and (2) concurrent access to diﬀerent HDF5 ﬁles both require a thread-safe versionof the HDF5 library. Although each thread in these examples is accessing diﬀerent data,the HDF5 library modiﬁes global data structures that are independent of a particular HDF5dataset or HDF5 ﬁle. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneousmanipulation from diﬀerent threads. Examples of HDF5 library global data structures thatmust be protected are the freespace manager and open ﬁle lists.”Thus, in spite of having multiple I/O threads, I/O requests for HDF5 ﬁles have to go throughthe HDF5 library. Fig. 7.6 demonstrates this, where K I/O threads are requesting I/Ofrom the HDF5 library in parallel. However, the lock inside the HDF5 serialises the parallelrequests, eﬀectively issuing only one request at a time to the operating system disk requestqueue. The operating system will put the thread to sleep and this is equivalent to a singleI/O thread. Thus, the total time spent on disk accesses T will be T = t × n , and essentially,the high throughput capability of multiple disks in a RAID conﬁguration is under-utilised.230.4. PROPOSED OPTIMISATIONS single process Multi-threaded disk access thread

O/S disk request queue disk controller and disks

I/O request thread thread

I/O requestI/O request

HDF5 library one

I/O request at a time … Figure 7.6: Elaboration of the limitation in HDF library.

In this section, we present two types of solutions to overcome the bottleneck in nanoporesoftware tools. The ﬁrst approach is to use an alternate ﬁle format (Section 7.4.1). However,current nanopore software tools have been developed on top of the FAST5 (HDF5) formatbecause of its adoption by Oxford Nanopore technologies as the ﬁle format for storing theraw signal. Thus, using a new ﬁle format may not always be practical. For such scenarios, wepresent a second solution that uses multi-processes instead of multi-threads for I/O operations(Section 7.4.2). This second solution does not require any changes to FAST5 format or theHDF5 library. Furthermore, we also present a few more optimisations to

Nanopolish thatenable further speed-up (Section 7.4.3).

We propose a new ﬁle format called SLOW5 for storing nanopore raw signal data as analternate to FAST5. We considered the domain knowledge from nanopore sequence analysisand the characteristics of disk accesses to design the new ﬁle format. SLOW5 File Format:

We design our proposed SLOW5 ﬁle format by extending the simpleand well-known tab-separated values (TSV) format, using inspiration from the gold standard The name SLOW5 is ironical to FAST5 231HAPTER 7. I/O OPTIMISATIONSgenomic ﬁle formats such as SAM [57] and VCF [237]. An example of the proposed ﬁle formatis shown in Table 7.1. The structure of the ﬁle is explained below.The ﬁrst set of lines of the SLOW5 ﬁle comprises the ﬁle header. Each header line startswith the character genomic reads and contains information such as the sequencing ﬂow-cell identiﬁer,and sequencing run identiﬁer, etc. The last line in the header gives the column names ofthe upcoming data, which are tab-separated. Note that, not all metadata and data ﬁelds aredepicted in Table 7.1 for the sake of brevity.The header is followed by data where each line (row) represents a single genomic read . In otherwords, for

N genomic reads , there are N data lines in the ﬁle. The genomic read informationﬁelds (e.g., read-identiﬁer, number of signal samples, and the raw signal) are tab separatedand are in the same order as deﬁned in the last line of the header. The raw_signal columncontains the current signal values separated by commas. Note that all data corresponding toa single genomic read are placed contiguously in the same row, thus facilitating locality indisk accesses. Working Explanation:

Random accesses to the genomic read records in a SLOW5 ﬁleare facilitated by an index called the SLOW5 index. The SLOW5 index is another tab-separated ﬁle as shown in Table 7.2. Each line corresponds (except the ﬁrst header line) to a genomic read . The ﬁrst column is the read-identiﬁer of the genomic read , the second columnis the ﬁle oﬀset to the corresponding SLOW5 record, and the third column is the size of thecorresponding SLOW5 record in bytes (including the new line character). For performingrandom disk accesses to SLOW5, the SLOW5 index is ﬁrst loaded to a hash table in RAMwhere the read-identiﬁer serves as the hash table key and the rest of the data is used as hashtable values. For a given read-identiﬁer, the ﬁle oﬀset and the record length is obtained fromthis hash table and the program can move the ﬁle pointer to the oﬀset (i.e. using lseek ) and232.4. PROPOSED OPTIMISATIONS one I/O request child-process 1

Multi-threaded disk access

O/S disk request queue disk controller and disks

I/O request

HDF5 library one I/O request child-process 2

I/O request

HDF5 library child-process K I/O request one I/O request parent-process data processing thread 1 data processing thread 2data processing thread K … … thread threadthread HDF5 library … Figure 7.7: The proposed multi-process based solution.load the record to the memory. Multiple random accesses to SLOW5 can be performed inparallel either through synchronous I/O calls with multiple threads, or through asynchronousI/O if supported by the operating system. Note that the raw signal data is read-only duringnanopore sequence analysis. Thus, SLOW5 is inherently thread-safe without any need ofglobal locks.

The original

Nanopolish runs a single process with multiple threads. We propose a multi-process based solution for scenarios where the existing FAST5 cannot be replaced. Multiplethreads in a single process share same address space and thus the lock in HDF5 libraryaﬀects multiple threads. Multiple threads are typically used to run sub-tasks in parallel whileconveniently sharing data amongst the threads. In contrast, multiple processes have theirown independent address spaces and are typically used to run isolated tasks in parallel. Weexploit the presence of independent address spaces in multiple processes to circumvent thelock in HDF5 library. 233HAPTER 7. I/O OPTIMISATIONS

Overview:

Our proposed multi-process based solution is elaborated in Fig. 7.7. We usemulti-threads in the single parent-process for data processing and multiple child-processesfor I/O. The parent-process performs data processing using multiple threads in parallel. Eachchild-process has its own instance of the HDF5 library, as a consequence of independentaddress spaces. Moreover, each child-process has only a single thread that requests I/O.Thus, a single instance of the HDF5 library gets only one request at a time. In eﬀect, thereare multiple instances of the HDF5 library that can submit multiple I/O requests in parallelto the operating system (as opposed to the situation in Fig. 7.6), thus beneﬁting from thehigh throughput oﬀered by the RAID conﬁguration. Formally, if there are K processes andif the disk controller can serve K requests in parallel, the total time spent on I/O operationswill be T = t × nK (similar to the case in Fig. 7.3). Details:

The proposed multi-process based solution can be adopted for Nanopore data pro-cessing using a pool of processes that performs FAST5 I/O. Multiple processes are spawnedat the beginning of the program using the fork system call. These forked child-processes forma pool of processes that exist until the lifetime of the parent-process, solely performing I/Oof FAST5 ﬁles. The data processing can be performed by multiple threads spawned by theparent-process as usual. The parent-process when it requires to load signal data of N reads(FAST5 accesses), ﬁrst splits the list of reads to K parts where K is the number of child-processes. Then, each part is assigned to each child-process, which performs the assignedFAST5 accesses. When data is loaded, the child-processes send data to the parent-process.The communication (data transfer) between the parent-process and child-processes can beimplemented relatively easily using unnamed pipes out of the available Inter Process Com-munication (IPC) techniques (still not easy as threads that share the same memory space). Note-1:

A fork-join model for multi-processes (as could be done for the multi-threading) isunsuitable to be used instead of the process pool model presented above. Firstly, creating aprocess can be very expensive and could easily become the biggest bottleneck than the ﬁlereading itself. Secondly, forking in the middle of a program could double the memory usage234.4. PROPOSED OPTIMISATIONS

Nanopolish (original version)

Restructured version (unoptimised) SLOW5 based version

Multi-process based version

Used in Section 3.2 & 5.2-5.4 Used in Section 5.3 & 5.4

Used in Section 5.2 Used in motivational exampleDiscussed in Section 3.2 & 4.3 Discussed in Section 4.3 Discussed in Section 4.1 & 5.1

Discussed in Sections 4.2 & 5.1

Restructuring for time measurements Miscellaneous optimisationsSLOW5 based optimisation

Multi-process based optimisation

Figure 7.8:

Flow diagram depicting modiﬁcations to

Nanopolish and is usually problematic.

Note-2:

We propose the use of multi-threads in the single parent-process for data processingand multiple child-processes for I/O. The possibility of using separate processes for both dataprocessing and I/O is discussed in Section 7.6.1 with its caveats.

In addition to the above I/O related optimisations, we also performed restructuring and afew other software optimisations with respect to multi-threading and memory. Our restruc-235HAPTER 7. I/O OPTIMISATIONSturing allows us to measure the execution time separately for I/O (including the executiontime breakdown for diﬀerent ﬁle formats) and data processing, without signiﬁcant eﬀect onperformance, whereas, the software optimisation improve the processing time.The original

Nanopolish implementation uses openMP for multi-threading. We restructured

Nanopolish to perform multi-threading using a lightweight fork-join model with work-stealingimplemented using POSIX Threads ( pthreads ). Moreover, the restructured

Nanopolish per-forms I/O operations and data processing batch by batch (batch of genomic reads ), i.e., abatch of genomic reads are loaded from the disk and the batch is then processed, subsequently,results of the batch are written to disk. I/O operations are interleaved with data processing,i.e. when the ﬁrst batch is being processed, the second batch will be loaded from the disk.The restructured

Nanopolish was further optimised with strategies such as: reducing thenumber of memory allocations ( malloc ) for dynamic 2D arrays by allocating a 1D array, anappropriate batch size that ﬁts the available RAM, and a better load-balancing between multi-threads, etc. While space limits our ability to explain each of these optimisations, the detailscan be found in the open sourced code of this research project. For the sake of clarity, theoverview of our restructuring and optimisations to original

Nanopolish and various resultingversions with their usage, are shown in Fig. 7.8.Table 7.1: Example of SLOW5 ﬁle format

SLOW5 ﬁle format . . . . . . .. . . . . . .. . . . . . . read- N Implementation of the Alternate File Format:

We implemented a C program to convertFAST5 (HDF5) ﬁles into our SLOW5 format. The program also constructs a SLOW5 indexas per the description in section 7.4.1. The restructured and optimised

Nanopolish (discussedin section 7.4.3) was modiﬁed to support reading from SLOW5 format (Fig. 7.8). At thebeginning of the program, the SLOW5 index is loaded onto a hash table that resides in RAM.For each genomic read in a batch, the start position of the corresponding SLOW5 record (ﬁleoﬀset) and size of the record (in bytes) is obtained from the index. Then, that information forall the genomic reads in the batch are submitted as I/O requests. The

POSIX AIO libraryin glibc is used for performing asynchronous I/O.

Implementation of the Multi-process Pool:

The restructured and optimised

Nanopolish was modiﬁed such that FAST5 ﬁles are loaded using a multi-process pool as per the descriptionin Section 7.4.2 (Fig. 7.8). At the beginning of the program, K child-processes are spawnedusing fork system call. Then, during the execution of the program, the parent-process dividesthe batch of genomic reads into K parts and assigns each part to a child-process. Child-processes performs FAST5 ﬁle reading (through HDF5 library) in parallel. After completionof reading by the child-processes, the data is collected by the parent-process. The inter-processTable 7.2: Example of SLOW5 index SLOW5 index . . .. . .. . . read- N ID Sample No. ofGbases No. ofreads Averageread length Maxread length FAST5ﬁle size

D1 T778 8.787 771 325 11 393 194 983 845GBcommunication is implemented using unnamed pipes in Linux.

Datasets and Computer Systems:

A representative nanopore dataset of the human genome was used for the evaluation andthe details are in Table 7.3. This dataset is a complete nanopore MinION dataset of theT778 cancer cell-line [173, 238]. The computer systems used for the experiments and theirspeciﬁcation are given in Table 7.4. Unless otherwise stated, the experiments in the chapterhave been performed on system S1. System S2 was used for limited number of experiments dueto the limited availability. For the experiments associated with Network File System (NFS),the NFS storage on system S3 was mounted on system S1. For NFS, default parameters forthe NFS server and client in Linux were used. Note that the operating system disk cache onS3 was also cleared before any NFS experiment.

Measurements and Calculations:

The measurement and calculations for our results areperformed as follows.

1) The Overall execution time (wall-clock time) and the CPU time (user mode + kernel mode)of the program (all version shown in Fig. 7.8) were measured by running the program through

GNU time utility in Linux.

2) The CPU utilisation percentage is computed as in equation 7.1. Note that this CPUutilisation percentage is a normalised value based on the number of data processing threads238.5. EXPERIMENT AND RESULTSTable 7.4: Computer systems

System ID

S1 S2 S3

Description

HPC with HDD RAID HPC with SSD RAID NFS server

CPU × Intel Xeon Gold 6154 2 × Intel Xeon Gold 6148 4 × Intel Xeon X7560

CPU cores

36 40 32

RAM

384 GB 768 GB 256 GB

Disk System × × × File System ext4 ext4 ext4

RAID conﬁg.

RAID6 RAID0 RAID5 OS Ubuntu 18.04.3 LTS CentOS 7.6.1810 Ubuntu 14.04.6 LTS that which the program was executed with.

CP U utilisation = CP U timeexecution time × number of threads ×

3) Execution time for individual components (I/O operations and data processing) in therestructured and/or optimised

Nanopolish (three versions at the bottom of Fig. 7.8) was mea-sured by inserting gettimeofday function calls into appropriate locations in the software sourcecode. To prevent the operating system disk cache aﬀecting the accuracy of I/O results, wecleared the disk cache ( pagecache , dentries and inodes ) each time before a program execution.Despite the eﬀect of the hardware disk controller cache ( ∼ ∼

4) Core-hours is calculated as the product of the number of processing threads employed andthe number of hours (wall-clock time) spent on the job. This metric is inspired by the metricman-hours used in labour industry and is used in Cloud Computing domain to calculate thedata processing cost [214]. In an ideally parallel program, this metric remains constant withthe number of cores/threads. 239HAPTER 7. I/O OPTIMISATIONS

Overall Execution Time and CPU Utilisation:

The overall execution time when ourproposed SLOW5 ﬁle format is used in the restructured and optimised

Nanopolish is shown inFig. 7.9a, while the CPU utilisation and the core-hours are depicted in Fig. 7.9b. The x-axisrepresents the number of data processing threads which the program was executed with. Thenumber of I/O threads for glibc

POSIX AIO was also set to the same number of threads asthe number of data processing threads.

Observation-1: Performance has improved w.r.t original Nanopolish for a given number ofthreads.

To observe this, we compare Fig. 7.1a with Fig. 7.9a. At 4 threads, executiontime improved by ∼ × compared to original Nanopolish . At 8, 16, and 24 threads speedupsof ∼ × , ∼ × , ∼ × can be observed, respectively. At 32 threads, ∼ . × speedup isobserved. In other words, speedup of our optimised version over original Nanopolish increaseswith the number of threads.

Observation-2: CPU Utilisation has improved with the number of threads when compared withoriginal Nanopolish.

Comparing Fig. 7.1b with Fig. 7.9b reveals that CPU utilisation at4 threads improved to 99% which was 69% for original

Nanopolish . The CPU utilisationincreases to 99% from 56%, 97% from 39%, and 92% from 28%, at 8, 16 and 24 threads,respectively w.r.t original

Nanopolish . At 32 threads, an improvement of CPU utilisation to85% was observed which was as low as 22% for original

Nanopolish . Observation-3: Performance scaling with number of threads is improved.

This is evident bythe core-hours plot in Fig. 7.9b, whose values are much smaller and almost constant whencompared with its counter-part in Fig. 7.1b. For the original

Nanopolish , the execution timewith 32 threads improved only by ∼ × compared to running with 4 threads (from 9.7hto ∼ ∼ × at 32 threads compared to 4 threads( ∼ ∼ E x e c u t i o n t i m e ( h o u r s ) Number of threads (a) Execution time C o r e h o u r s ( t i m e x t h r e a d s ) C P U U t ili s a t i o n ( % ) Number of threadsCPU Utilisation Core hours (b) CPU utilisation

Figure 7.9: Overall execution time and CPU utilisation when SLOW5 format is used

I/O Time Consumption:

It was discussed previously that the identiﬁed bottleneck is due to I/O. Therefore, to get moreinsight into the eﬀectiveness of our proposed solution, we plot and compare the time spent inI/O operations. Speciﬁcally, the time spent for reading nanopore raw signal data on systemS1 when using SLOW5 format is compared to when using FAST5 format in Fig. 7.10a. We T i m e ( h o u r s ) Number of I/O threads fast5 access slow5 access (a) On system S1: HDD RAID T i m e ( h o u r s ) Number of I/O threads fast5 access slow5 access (b) On system S2: SSD RAID

Figure 7.10: Comparison of FAST5 vs SLOW5 access241HAPTER 7. I/O OPTIMISATIONS E x e c u t i o n t i m e ( h o u r s ) Number of threads (a) Execution time C o r e h o u r s ( t i m e x t h r e a d s ) C P U U t ili s a t i o n ( % ) Number of threadsCPU Utilisation Core hours (b) CPU utilisation

Figure 7.11: Overall results for multi-process poolmake the following observations from the ﬁgure: 1) there is no improvement in FAST5 accesstime (brown bars) despite increasing the number of threads used (due to the lock in HDF5library); 2) in contrast, there is a signiﬁcant improvement for the proposed SLOW5 accesstime (blue bars) with the increased number of threads; and, 3) even at a single thread, theproposed SLOW5 is ∼ × times faster than FAST5, and at 32 threads the improvement ofSLOW5 compared to FAST5 is ∼ × . The speed-up in I/O time for the single thread iscontributed by the exploitation of locality (discussed in Section 7.4.1) and our lightweightSLOW5 access implementation.Above experiments on system S1 demonstrated that our proposed solution eﬀectively im-proves performance of Nanopolish . The S1 system consists of HDD RAID. Now, we demon-strate that our solution is also eﬀective on SSD RAID using experiments on system S2. Asdiscussed above, the I/O decomposition results are more insightful, therefore we present theI/O decomposition results on S2 system (SSD RAID based) for the sake of brevity of themanuscript. Fig. 7.10b shows the comparison of FAST5 access time to SLOW5 access time,where similar observations can be made. In fact, FAST5 access time (brown bars) got worsewith the number of threads, whereas SLOW5 access time (blue bars) improved with the num-242.5. EXPERIMENT AND RESULTSber of threads. At 32 threads SLOW5 was ∼ × faster than FAST5 on SSD RAID. Thus,our proposed solution is eﬀective for the HDD based RAIDs as well as the SSD based RAIDs. Note:

The ﬁle sizes of the new SLOW5 format are comparable to the existing FAST5 format.Speciﬁcally, the dataset which was 845 GB in FAST5 format (Table 7.3), reduced to 340 GBwhen converted to SLOW5. The reduced size when converted to SLOW5 is due to storingglobal metadata in the header in SLOW5, instead of redundantly storing those for each read.SLOW5 index is quite small (47 MB) compared to gigabytes of RAM available on an HPC.

Overall Execution Time and CPU Utilisation:

Overall execution time for the restruc-tured and optimised

Nanopolish when a multi-process pool is used for FAST5 access is shownin Fig. 7.11a, whereas the CPU utilisation and the core-hours are depicted in Fig. 7.11b. Thex-axis of the ﬁgure corresponds to the number of data processing threads which is also equalto the number of I/O processes. The results in Fig. 7.11 are similar to that of the SLOW5solution discussed in previous subsection. The key observations in Fig. 7.11 compared to T i m e ( h o u r s ) Number of I/O unitsI/O threads I/O processes (a) On system S1: HDD RAID T i m e ( h o u r s ) Number of I/O unitsI/O threads I/O processes (b) On system S2: SSD RAID

Figure 7.12: FAST5 ﬁle access using multiple I/O threads vs I/O processes243HAPTER 7. I/O OPTIMISATIONSoriginal

Nanopolish are also similar to the ﬁrst solution. These are: 1) improved performancew.r.t. the original

Nanopolish for a given number of threads; 2) improved CPU Utilisation;and, 3) better performance scaling with increasing number of threads, as depicted by thenear-ﬂat core-hour plot.

I/O Time Consumption:

Similar to the previous section, we evaluate the time spentin I/O operations. We compare the results for the multi-threaded and the multi-processbased versions. The plots are presented in Fig. 7.12, with the x-axis denoting the number ofprocesses/threads used. On HDD RAID (Fig. 7.12a), the FAST5 access time does not improvewith increased I/O threads (brown bars), while it signiﬁcantly improves with increased I/Oprocesses (green bars). At 32 threads/processes the improvement was ∼ × . On SSD RAID(Fig. 7.12b), the FAST5 access time gets worse with increased I/O threads. In contrast, itsigniﬁcantly improves with increased I/O processes. Using 32 I/O processes is ∼ × fasterthan using 32 I/O threads on SSD RAID.In summary, using processes instead of threads for I/O operations alleviates the I/O bot-tleneck, while using multiple-threads for data processing in a single parent-process avoidsintroduction of any additional signiﬁcant bottlenecks, as depicted by the above results. Comparing the time for SLOW5 access in Fig. 7.10 with I/O process based pool for FAST5(Fig. 7.12) shows that SLOW5 outperforms FAST5 even when multiple I/O processes areused especially at lower number of threads/processes.244.5. EXPERIMENT AND RESULTS T i m e ( h o u r s ) Number of I/O threadssingle-FAST5 multi-FAST5 single- vs multi- (a) On system S1: HDD RAID T i m e ( h o u r s ) Number of I/O threads single-FAST5 multi-FAST5 vs multi-fast5 (b) On system S2: SSD RAID

Figure 7.13: Single-FAST5 vs Multi-FAST5 using I/O threads

ONT is recently working on a new ﬁle format: known as multi-FAST5 . It is projected toreplace the existing FAST5 format in near future. The raw signals from multiple genomicreads (by default 4000 genomic reads ) are packed into a FAST5 ﬁle and such ﬁles are termed as multi-FAST5 . Multi-FAST5 reduces the gigantic amount of small single-FAST5 ﬁles generatedfrom a sequencing run, easing the ﬁle management (eg: copying/moving ﬁles, listing ﬁles).Multi-FAST5 ﬁles are also HDF5 ﬁles where the schema is an extended version for that ofsingle-FAST5. Next, we demonstrate that multi-FAST5 suﬀers from a similar bottleneck,and thus our proposed SLOW5 is superior to the new multi-FAST5 format. Moreover, ourmulti-process based solution is also applicable and eﬀective for multi-FAST5 format.

Performance Bottleneck in Multi-FAST5:

First, we compare the ﬁle access time inmulti-FAST5 to single-FAST5 in Fig. 7.13. Unfortunately, the access time does not improveby the use of multiple I/O threads on HDD RAID, similar to single-FAST5 (Fig. 7.13a). Infact, multi-FAST5 performance is actually worse than that of single-FAST5. On SSD RAID(Fig. 7.13b), the performance of multi-FAST5 and single-FAST5 are almost the same and245HAPTER 7. I/O OPTIMISATIONS T i m e ( h o u r s ) Number of I/O processes single-FAST5 multi-FAST5 (a) On system S1: HDD RAID T i m e ( h o u r s ) Number of I/O processes single-FAST5 multi-FAST5 (b) On system S2: SSD RAID

Figure 7.14: Single-FAST5 vs Multi-FAST5 using I/O processesgets gradually worse with the number of threads.

Proposed Multi-process Solution on Multi-FAST5:

Now we demonstrate that ourmulti-process based solution is also applicable and eﬀectively improves the performance forthe new multi-FAST5 format. The access time for Multi-FAST5 and single-FAST5 with ourmulti-process solution for diﬀerent number of threads are in Fig. 7.14. With our solution,the trend of multi-FAST5 access time is similar to that of single-FAST5 ﬁles (on both HDDRAID and SSD RAID), that is, it gets signiﬁcantly better with the number of I/O processesused. Note that when the time with single-FAST5 is compared, multi-FAST5 takes more timethan single FAST5, visibly in the HDD RAID.

Our proposed optimisations has the potential to beneﬁt direct execution of nanopore dataanalysis tools on data residing on a network attached storage. HPC cluster environmentspredominantly use such network attached storage in addition to local RAID systems. We246.5. EXPERIMENT AND RESULTS T i m e ( h o u r s ) Number of I/O threadsfast5 access slow5 access (a) SLOW5 performance on NFS T i m e ( h o u r s ) Number of I/O unitsI/O threads I/O processes (b) FAST5 performance on NFS

Figure 7.15: Performance on NFSdemonstrate the performance of our proposed methods on NFS in Fig. 7.15.Fig. 7.15a compares our SLOW5 format with FAST5 on NFS over multiple I/O threads. Useof multiple I/O threads for accessing FAST5 ﬁles on NFS (brown bars), slightly improves theperformance up to around 4 threads (unlike previously on local RAID), which then saturates.SLOW5 access (blue bars) is much faster than FAST5. SLOW5 access time improves up toaround 8 threads which then saturates.Fig. 7.15b compares our proposed process pool based method to using multiple I/O threads.Use of multiple I/O processes (green bars) considerably improves the FAST5 access perfor-mance up to around 8 processes, which then slowly saturates, a similar trend to that withSLOW5. Comparing SLOW5 (blue bars in Fig. 7.15a) to FAST5 access using multiple I/Oprocesses (Fig. 7.15b) shows that SLOW5 performance is superior.Refer to appendix H for supplementary results and analyses.247HAPTER 7. I/O OPTIMISATIONS

In this chapter, we presented two solutions to overcome the I/O bottleneck caused by theFAST5 ﬁle format and demonstrated their eﬃcacy using experiments. Additionally, there arefew other possible solutions to the problem, as discussed below.

Fixing HDF5 Library:

As discussed above, using a new ﬁle format may not always be prac-tical and we presented a multi-process based solution in such a scenario. Another candidatesolution is to ﬁx (optimise) the HDF5 library to be thread eﬃcient. However, HDF5 libraryis a complicated library with a large code base of > Naive Approaches of Multi-processing:

Instead of using a process pool solely for FAST5I/O and multi-threads for parallel data processing (as proposed in Section 7.4.2), programmersmay use multi-processes for both the I/O operations and parallel data processing. Thiswould be easier than implementing a pool of processes, however, this is only suitable fortrivially parallel cases. If the application needs to share data among multiple processingunits, processes are unsuitable due to the complexity that arise when performing inter-processcommunication.Alternatively, the programmer may let users manually split data and launch multiple pro-cesses. Unfortunately, this method exerts additional burden on the user, i.e., custom scriptsmust be written for data splitting, launching data processing and concatenating the result.Moreover, this is only suitable for trivially parallel applications where data can be easily split.248.6. DISCUSSIONAlso, an expensive HPC system with dozens of cores is superﬂuous as the user could use acluster of low cost networked computers (as shown in chapter 6).In summary, using processes instead of threads potentially solves the I/O bottleneck as wedemonstrated in results. However, it is important to note that processes in an operating systemare meant for isolation whereas threads are for sharing data. Inter-process communicationrequires system calls, while inter-thread communication involves sharing the same memoryspace. Further, processes are expensive to be spawned and are not lightweight (unlike threads).Thus, using processes as a replacement to threads makes the code relatively complicated.Therefore, we suggest that using the SLOW5 format is a superior solution than the multi-processes based solution.

As shown in section 7.5.2, SLOW5 ﬁle size is smaller than FAST5 due to the eﬃcient storageof metadata. SLOW5 ﬁle size can potentially be further reduced by using a binary encodinginstead of ASCII and/or by applying block compression techniques such as BGZF that stillallows random access [239]. Having both ASCII and binary formats is useful, where theformer is human readable and the latter is space eﬃcient. In fact, gold standard ﬁle formatsin genomics such as SAM and VCF that are in ASCII have their binary counterparts BAMand BCF.After applying our proposed optimisations proposed in this chapter, the next bottleneckin

Nanopolish could be the FASTA access (random access to reference genome) which isperformed using the faidx component in htslib library. This faidx is not currently thread-safeand thus only single threaded access is possible. However, FASTA is a simple ASCII basedformat and thus extending faidx for thread eﬃciency is feasible as future work.249HAPTER 7. I/O OPTIMISATIONS

It is likely that the identiﬁed limitation in HDF5 libary is a primary bottleneck in severalother nanopore software toolkits, which also use the HDF5 library such as

Tombo [240],

NanoMod [241] and

SquiggleKit [242]. Thus, our proposed optimisations are potentially usefulin such toolkits. Our work may also guide nanopore software developers to avoid the identiﬁedbottleneck in future. Furthermore, HDF5 is also used in other engineering domains such asphysics, astronomy, weather forecasting [243]. Therefore, we believe that our work will inspireoptimisations in those domains.

In this chapter, we demonstrated with an example that nanopore software fail to take maximaladvantage of the computing power oﬀered by many-core processors in HPC systems, despitemulti-threaded implementation. To address this problem, we presented a systematic experi-mental analysis to identify potential performance bottlenecks in nanopore software tools forrunning on many-core CPUs. We identiﬁed that the bottleneck is caused by ineﬃcient ﬁle I/Oassociated with the HDF5 library used for loading nanopore raw data. The ineﬃciency in ﬁleI/O in HDF5 is due to a global lock which limits multiple threads requesting ﬁle accesses inparallel. Then, we proposed multiple optimisations to alleviate the bottleneck. We proposeda new ﬁle format that facilitates eﬃcient ﬁle access using multiple threads. For the scenarioswhere the original format must be used, we presented a multi-process based solution. Thus,our proposed optimisations can be used as an alternative, or alongside the existing ﬁle-format.Our experiments demonstrated that our optimisations not only enable improved performancefor a given number of threads ( ∼ × for 4 threads and ∼ . × for 32 threads), but also enableimproved CPU utilisation (from 69% to 99% for 4 cores and from 22% to 85% for 32 threads)when compared to original Nanopolish . Consequently, improved performance scaling with the250.7. SUMMARYnumber of threads was also achieved ( ∼ . × for 4 vs. 32 threads).251HAPTER 8. CONCLUSION AND FUTURE DIRECTIONS Chapter 8

Conclusion and Future Directions

DNA sequencing is a revolutionary technology that is reshaping the ﬁeld of medicine andhealthcare. In addition, DNA sequencing has important applications in other ﬁelds suchas epidemiology and forensics. Over the last two decades, the size of DNA sequencers hasshrunk from the size of a fridge to that of a mobile phone and sequencing cost per genome hasremarkably reduced by more than 1000 times. These remarkable improvements are expectedto continue further. Unfortunately, hundreds to thousands of gigabytes of data output fromtoday’s ultra-portable sequencers is analysed on non-portable high-performance computers orcloud computers, which was the case even a couple of decades ago.This thesis moved the DNA sequence analysis from high-performance computers to portablecomputing devices, a timely need for enhancing the use of ultra-portable sequencers in point-of-care or in-the-ﬁeld. The objective was achieved using computer architecture-aware optimi-sation of complex DNA analysis workﬂows. Such optimisations enabled eﬃcient mapping ofthe software to exploit complex features of modern computer hardware. Domain knowledgeof both computer architecture and DNA sequence analysis was simultaneously used to achievethe twin goals of achieving eﬃcient compute resource utilisation with no impact on accuracy.252herefore, this thesis is an attempt to bridge the two domains, DNA sequencing and computerarchitecture.In this thesis, gold-standard DNA sequence analysis software tools were systematically ex-amined for bottlenecks and architecture-aware optimisations were performed at I/O level,processor level, RAM level, cache level and at the register level. The optimised software toolswere used to perform complete end-to-end analysis workﬂows on prototype embedded systemscomposed of single-board computers. The performance and accuracy were evaluated usingreal and representative datasets. The resultant embedded systems were fully functional withperformance comparable to an unoptimised workﬂow on a high-performance computer. Theconstructed prototype embedded systems are currently being used for in-house data analysisat Garvan Institute of Medical Research, Sydney. Such low cost, energy-eﬃcient, suﬃcientlyfast and portable embedded system enables complete DNA analysis in point-of-care or in-the-ﬁeld.In addition to prototype embedded systems composed of single board computers, this thesishas also made it possible to run DNA analysis workﬂows on commodity portable computingdevices such as laptops, tablets and mobile phones. The optimisations proposed in thisthesis also beneﬁt running DNA analysis workﬂows on high-performance computers througha magnitude of times faster performance. Optimised versions of software produced under thethesis are released as open-source software. The prototype embedded systems constructedunder this thesis are fully functional that they are currently being used for in-house nanoporesequence data processing at Garvan Institute of Medical Research, Sydney. The open-sourcesoftware produced under the thesis is being used by several research centres globally andusers have surprised by the signiﬁcant speedup achieved compared to existing software. Theconclusion from each chapter of this thesis is given below.A popular variant calling software for second-generation sequence data called

Platypus wasoptimised for eﬃcient usage of the memory hierarchy. Systematically examining the steps in253HAPTER 8. CONCLUSION AND FUTURE DIRECTIONSvariant calling revealed that 60% of the total variant calling time is consumed by de Bruijn graph construction during the local re-assembly step. After carefully inspecting the data ac-cess patterns, optimisations were proposed to improve the locality of memory accesses bothat cache level and register level. The existing algorithm was modiﬁed to integrate the pro-posed optimisations, which in turn improved the eﬃcient usage of faster cache memories andregisters. The results showed that these changes improve the performance of de Bruijn graphconstruction by a factor of around two when implemented on a general-purpose processor. Themodiﬁed algorithm opens the door to a much higher acceleration of local re-assembly on GPU,FPGA and ASIP. The implementation of the algorithm which is integrated into the

Platypus

Variant Caller is publicly available at https://github.com/hasindu2008/platycflr .The gold standard software for aligning long reads generated from third-generation high-throughput sequencers called

Minimap2 was optimised for removed memory capacity.

Min-imap2 relies on a large hash table data structure (constructed out of the reference genome)stored in RAM for the alignment process. Large reference genomes such as the human genomerequire 11GB for the hash table alone. Mere parameter optimisation in

Minimap2 cannotsubstantially reduce memory usage without considerably sacriﬁcing alignment quality. Mem-ory capacity optimisations were proposed to substantially reduce memory usage. Memorycapacity optimisations included partitioning an alignment index, saving the internal state,and merging the output a posteriori . This strategy reduced the memory requirements foraligning long reads to the human reference genome from 11GB to less than 2GB, with min-imal impact on accuracy. This work made it possible to perform read alignment to largereference genomes using computers with limited volatile memory. The optimised version ofMinimap2 is available as open-source at https://github.com/hasindu2008/minimap2-arm and is also integrated into the original

Minimap2 software.A popular signal analysis toolkit for analysing nanopore raw signal data called

Nanopolish was optimised for CPU-GPU heterogeneous systems. Examining the methylation callingtool in the

Nanopolish toolkit revealed that around 70% of the runtime is consumed by an254lgorithm called Adaptive Banded Event Alignment (ABEA). Despite this algorithm being notembarrassingly parallel, an approach was proposed that made this algorithm eﬃciently executeon GPUs. The high variability of the read lengths was one of the main challenges, which wasremedied through a number of memory optimisations and a heterogeneous processing strategythat uses both CPU and GPU. Proposed optimisations yielded around 3-5 × performanceimprovement on a CPU-GPU system when compared to a CPU. CPU-GPU optimised ABEAwas integrated back into a completely re-engineered version of the Nanopolish methylationcalling tool and this resultant new software was named f5c . It was demonstrated that f5c is adequately capable of processing data from a portable nanopore sequencer in real-timeusing an embedded SoC equipped with an ARM processor (with six cores) and NVIDIA GPU(256 cores). f5c not only beneﬁts embedded SoC but also a wide range of systems equippedwith GPUs from laptops to servers. f5c was not only around 9 × faster on an HPC but alsoreduced the peak RAM by around 6 × times. The source code of f5c is made available at https://github.com/hasindu2008/f5c .A system architecture was proposed for performing a popular DNA methylation detectionworkﬂow on a prototype embedded system. The workﬂow was realised on the proposedarchitecture by integrating the optimised software versions from previous chapters. The pro-posed architecture was evaluated using oﬀ-the-shelf single-board computers and was demon-strated that performing real-time analysis of nanopore sequencing is possible on an embed-ded system. It was further demonstrated that the performance of the prototype embed-ded system is surprisingly similar to the performance on an HPC. The system architec-ture and the associated software for building a replica of the prototype are released andthe open-source code is available at https://github.com/hasindu2008/nanopore-cluster and https://github.com/hasindu2008/f5p .The cause behind the unexpected slow performance on an HPC was identiﬁed to be theNanopore software failing to take maximal advantage of the computing power oﬀered bymany-core processors in HPC systems, despite its multi-threaded implementation. A sys-255HAPTER 8. CONCLUSION AND FUTURE DIRECTIONStematic experimental analysis was conducted to identify potential performance bottlenecksin nanopore software tools for running on many-core CPUs. This analysis revealed that thebottleneck is caused by ineﬃcient ﬁle I/O associated with the HDF5 library used for loadingnanopore raw data. The ineﬃciency in ﬁle I/O in HDF5 was identiﬁed to be due to a globallock which limits multiple threads requesting ﬁle accesses in parallel. Multiple optimisationswere proposed to alleviate the bottleneck: a new ﬁle format that facilitates eﬃcient ﬁle accessusing multiple threads; and, a multi-process-based solution for the scenarios where the orig-inal format must be used. Thus, the proposed optimisations can be used as an alternative,or alongside the existing ﬁle-format. The experiments demonstrated that the optimisationsnot only enable improved performance for a given number of threads ( ∼ × for 4 threadsand ∼ . × for 32 threads) but also enable improved CPU utilisation (from 69% to 99% for4 cores and from 22% to 85% for 32 threads) when compared to the original Nanopolish .Consequently, improved performance scaling with the number of threads was also achieved( ∼ . × for 4 vs. 32 threads).Conclusively, the architecture-aware optimisations presented in this are signiﬁcant contribu-tions that result in an ultra-portable DNA analysis system which additionally beneﬁt theperformance of DNA analysis workﬂow on an HPC. In the upcoming decades, DNA sequencers will further miniaturise and the sequencing costwill be increasingly aﬀordable. Consequently, DNA tests have the potential to be routineand decentralised as are today’s blood tests. The realisation of this prospect requires se-quence analysis devices also to be further miniaturised. This thesis has put the foundationby demonstrating functional complex DNA sequence analysis workﬂows on prototypical em-bedded systems constructed out of over-the-shelf embedded computing device. The goal was256.1. FUTURE DIRECTIONSarchived through architecture-aware optimisation of the analysis software, and with the futuregoal of building domain-speciﬁc architectures for DNA sequence analysis in mind.The proposed architecture in this thesis was evaluated by a prototype constructed out of mul-tiple single-board computers interconnected using Ethernet. The prototype is bulky, mainlydue to many cables. However, designing a custom carrier board that can accommodate multi-ple oﬀ-the-shelf system-on-modules with integrated Ethernet and power delivery will producea system that is many times smaller than the prototype. Besides, the proposed system canbe miniaturised into a single chip by designing a multiprocessor system on a chip (MPSoC)composed of application-speciﬁc instruction-set processors (ASIP). Such an MPSoC will bea magnitude of times smaller, with superior performance and lower energy when comparedto the current prototype. Such an MPSoC integrated into an ultra-portable sequencer willenable complete DNA analysis on the palmtop.Orthogonal to the above directions, the currently developed embedded system can be ex-tended to an end-to-end application of direct biological signiﬁcance such as a diagnostic test.However, such a direction would require strong collaboration with biologists and clinicians.Furthermore, there are other branches in genomics that can be explored, for instance, Ribonu-cleic acid (RNA) workﬂows, meta-genomic analyses and de-novo assembly. As exempliﬁed inthis thesis for two DNA analysis workﬂows (genetic variant detection using second-generationsequencing and epigenetic modiﬁcation detection using third-generation sequencing), the otherworkﬂows also will have signiﬁcant room for improvement through architecture-aware optimi-sation alone. Increased collaboration between researchers from the two domains—computerarchitecture and DNA sequencing—will be favourable to eﬃciently reduce the gap betweenDNA sequencing analysis. 257PPENDIX A. APPENDIX: FEATHERWEIGHT LONG READ ALIGNMENT

Appendix A

Supplementary Materials -Featherweight Long ReadAlignment using PartitionedReference Indexes

This appendix is published as supplementary material of [25] in Nature Scientiﬁc Reportsunder Creative Commons CC BY license. 258.1. SUPPLEMENTARY NOTE 1

A.1 Supplementary Note 1 - Detailed Methodology of themerging

This supplementary note elaborates the merging method in detail together with some imple-mentation details.

A.1.1 Serialising (dumping) of the internal state

For each part of the partitioned index, a separate intermediate ﬁle (which we refer to as adump) is created in the binary format [refer line 36-44 in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c ]. After a read is aligned to the partition of theindex currently in memory, all the intermediate states for its alignments are dumped intothis binary ﬁle [line 501-506 in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/map.c ]. Binary format was preferred as it reduces the ﬁle size compared to ASCII.When the last read is mapped to the current partition of the index in memory, the dumpwill contain the intermediate state of the mappings for all the reads, in the same order as thereads in the input read set. If the partitioned index had n partitions, at the end of the n th partitions we will have n such dumps.The dumped internal state includes; ﬁfteen 32-bit unsigned integers (such as the referenceID, chaining scores, query and reference start and end), two 32-bit signed integers and oneﬂoating point value. All these information are inside a single structure in Minimap2 (called mm_reg1_t in minimap.h ) which made the dumping convenient. The size required for asingle alignment is around 80 bytes.If the user has requested Minimap2 to generate the base-level alignment, then the internalstate for base-level alignment are also dumped. Base-level alignment information include; six32-bit integers (such as the base level alignment score, number of CIGAR operations and259PPENDIX A. APPENDIX: FEATHERWEIGHT LONG READ ALIGNMENTa variable size ﬂexible integer array for storing CIGAR operations. These information arestored inside another structure in Minimap2 (called mm_extra_t ), which is only allocated ifthe base level alignment has been requested. The memory address to this structure is storedas a pointer in the previously mentioned mm_reg1_t structure. When dumping, we ﬂattenthe information linearly (eliminate memory pointers) to the ﬁle.In addition to the above, a quantity called replen (sum of lengths of regions in the readthat are covered by highly repetitive k-mers) is dumped. This is a per read quantity. Wesave the replen to the same dump ﬁle that we discussed above, just after the informationfor each mapping. For each read there will be a replen for each part of the index, thatis saved in the dump for that particular part of the partitioned index [line 495 of https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/map.c ]. A.1.2 Merging operation

When alignment of all reads to all parts of the index completes, the merging operationis invoked [ merge function in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c ]. We simultaneous open the read ﬁle and the dump ﬁles for all parts ofthe partitioned index. Reads are sequentially loaded while loading all the internal states forthe alignments of that read. This includes the internal state for all its alignments (includesthe base-level information if it had been requested) as well as the replen from each dump ﬁle.The ﬂattened data in the ﬁles are restored to their original structures when loading to thememory.If no base-level alignments had been requested, the alignments are sorted based on the chain-ing score in descending order [function mm_hit_sort_by_score in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c ]. If base-level alignment had beenrequested, they are sorted based on the base-level DP alignment score. Categorisation of260.1. SUPPLEMENTARY NOTE 1primary and secondary chains is performed on the sorted alignments according to the samemethod done on Minimap2 (using mm_set_parent function). This ﬁxes the issue with theprimary vs secondary ﬂag. Then the alignment entries are ﬁltered based on the user requestednumber of secondary alignments and the priority ratio (using mm_select_sub function). Thiseliminates the issue of outputting secondary alignments for each part of the index that makesthe output size huge. If the output has been requested in form of a SAM ﬁle, the bestprimary alignment is set to the primary ﬂag while all other primary alignments are set tosupplementary (using mm_set_sam_pri function).The mapping quality (MAPQ) estimation depends on the length of the read covered by repeatregions in the genome. To compute a perfect value for this quantity, the whole index needsto be in the memory which is the case for a single reference index. However, we estimate thisquantity by taking the maximum out of the replen values that were dumped for the particularread. The Spearman correlation of this estimated value to the perfect replen was 0.9961. Asthe mapping quality is anyway an estimation, computing the mapping quality based on theestimated replen does not aﬀect the ﬁnal results signiﬁcantly. A.1.3 Emulated single reference index

For memory eﬃciency, Minimap2 stores meta-data of reference sequences (such as the se-quence name and sequence length) only in the reference index (refer to mm_idx_t structin minimap.h ). The order in which the sequences reside in the struct array forms a uniquenumeric identiﬁer for each reference sequence.In the internal state for mappings only this numeric identiﬁer is stored. The meta-data forthe reference sequence are resolved using these numeric identiﬁers, only during the outputprinting. However, during merging we do not have the reference indexes in memory andthe numeric identiﬁers cannot be resolved. Hence, we construct an emulated single reference261PPENDIX A. APPENDIX: FEATHERWEIGHT LONG READ ALIGNMENTindex. For this, we save the meta-data of the reference sequences when each part of the par-titioned index is loaded [line 47-54 in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c ]. These meta-data go to the beginning of the dump ﬁle for theparticular part of the index. At the beginning of the merging, the meta-data is loaded back toform an emulated single reference index [line 164-173 in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c ]. However, the numeric identiﬁers in the internalstates from the dump ﬁles are incorrect (as numeric the identiﬁer is an independent incre-menting index for each part of the index). These are corrected to be compatible with thenumeric identiﬁers in the emulated single reference index by adding the correct oﬀset [line254 in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c ].As a side eﬀect of this emulated single reference index, a correct SAM header can be outputeven in the partitioned mode. Further, the merging process which merges the mappings for aread at a time, outputs the mappings for a particular read ID adjacently. Hence, no additionalsorting is required for any downstream analysis tools that require so.

A.2 Supplementary Note 2 - Detailed Methodology of thechromosome balancing

A.2.1 Memory eﬃciency for references with unbalanced lengths

The existing partitioned index construction method in Minimap2, does not balance the size ofindex partitions when the reference genome has sequences (chromosomes) with highly varyinglengths. This existing index construction method puts the reference sequences to the index inthe order they exist in the reference genome. When constructing a partitioned index, it moves262.2. CHROMOSOME BALANCINGto the next part of of the index only when the user speciﬁed number of bases per index (bydefault 4 Gbases) is exceeded. When building a partitioned index for overlap ﬁnding, the partswould be approximately equal in size as the length of the longest read would be a few megabases. However, in case of a reference genomes like the human genome where the chromosomesare of highly variable lengths, the size of the parts are unbalanced. The largest part of theindex determines the peak memory. Hence, an unbalance will hinder the maximum eﬃciencyfor systems with limited memory. For instance, consider a hypothetical genome (total length700M) with following chromosomes and lengths in the order chr1 (300M), chr2 (320M), chr3(60M), chr4 (20M). Providing a value of 350M as the number of bases in a partition (withthe intention of splitting into 2 parts), will create an unbalanced index as follows.• part1 : chr1, chr2 : total length - 620M• part2 : chr3, chr4 : total length - 80MWe follow a simple partitioning approach to balance this out. Instead of the number of basesper partition, the number of partitions is taken as a user input. The reference sequences areﬁrst sorted in descending order based on the sequence length (length without the ambiguous Nbases). The sum of bases in each partition is initialised to 0. The, the sorted list in traversedin order while assigning the current sequence into the partition with the minimum sum ofbases. The sum of bases in that partition is updated accordingly. Using this strategy, we geta distribution as follows.• part1 : 300M, 60M : total length - 360M• part2 : 320M, 20M : total length - 340M263PPENDIX A. APPENDIX: FEATHERWEIGHT LONG READ ALIGNMENT

A.3 Supplementary Note 3 - Instructions to run the tools

A.3.1 Example

1. Download and compile minimap2 that supports partitioned indexes and merging wget https :// github . com / hasindu2008 / minimap2 - arm / archive / v0 .1. tar . gz tar xvf v0 .1. tar . gz && cd minimap2 - arm -0.1 && make

2. Download the human reference genome and create a partitioned index with 4 partitions wget -O hg38noAlt . fa . gz http :// bit . ly / hg38noAlt && gunzip hg38noAlt . fa . gz ./ misc / idxtools / divide_and_index . sh hg38noAlt . fa 4 hg38noAlt . idx ./minimap2 map - ont Note : http://bit.ly/hg38noAlt redirects to ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

3. Download a Nanopore NA12878 dataset and run Minimap2 with merging wget -O na12878 . fq . gz http :// bit . ly / NA12878 ./ minimap2 -a -x map - ont hg38noAlt . idx na12878 . fq . gz -- multi - prefix tmp >out . sam Note : http://bit.ly/NA12878 redirects to http://s3.amazonaws.com/nanopore-human-wgs/rel3-nanopore-wgs-84868110-FAF01132.fastq.gz

Notes :• To perform mapping without base-level alignment use:264.3. SUPPLEMENTARY NOTE 3 - INSTRUCTIONS TO RUN THE TOOLS ./ minimap2 -x map - ont hg38noAlt . idx na12878 . fq . gz -- multi - prefix tmp >out . paf • From Minimap2 version 2.12-r827 [https://github.com/lh3/minimap2/blob/master/NEWS.md --split-prefix option instead of --multi-prefix . A.3.2 Index construction with chromosome size balancing divide_and_index.sh is the wrapper script for balanced index construction. It takes thereference genome and outputs a partitioned index optimised for reduced peak memory. Itsusage is as follows: usage : ./ divide_and_index . sh < reference .fa > < num_parts > < out . idx > < minimap2_profile > reference . fa - path to the fasta file containing the reference genome num_parts - number of partitions in the index out . idx - path to the file to which the index should be dumped minimap2_exe - path to the minimap2 executable minimap2_profile - minimap2 pre - set for indexing ( map - pb or map - ont ) Example : ./ divide_and_index . sh hg19 . fa 4 hg19 . idx minimap2 map - ont

Functionality of divide_and_index.sh is as follows.1. Compiling divide.c using gcc to produce divide .265PPENDIX A. APPENDIX: FEATHERWEIGHT LONG READ ALIGNMENT2. Calling the compiled binary divide to split the reference genome into partitions suchthat the total length of chromosomes in each partition are approximately equal.3. Calling the minimap2 binary separately on each reference partition to produce a separateindex ﬁle for each partition.4. Combining all the index ﬁles to produce a single partitioned index ﬁle.

A.3.3 Running Minimap2 on a partitioned index with merging

To run minimap2 on an index created using the above method : minimap2 -x < profile > < partioned_index . idx > < reads . fastq > -- multi - prefix --multi-prefix which takes a preﬁx for temporary ﬁles, enables the merging of the outputsgenerated through iterative mapping to index partitions.266 ppendix B Appendix: f5c

Documentation

This appendix is based on the f5c documentation available at https://hasindu2008.github.io/f5c associated with the GitHub repository at https://github.com/hasindu2008/f5c . B.1 Readme f5c is an optimised re-implementation of the call-methylation and eventalign modules in Na-nopolish. Given a set of basecalled Nanopore reads and the raw signals, f5c call-methylation detects the methylated cytosine and f5c eventalign aligns raw nanopore DNA signals (events)to the base-called read. f5c can optionally utilise NVIDIA graphics cards for acceleration.First, the reads have to be indexed using f5c index . Then, invoke f5c call-methylation todetect methylated cytosine bases. Finally, you may use f5c meth-freq to obtain methylationfrequencies. Alternatively, invoke f5c eventalign to perform event alignment. The results267PPENDIX B. APPENDIX:

F5C

DOCUMENTATIONare almost the same as from nanopolish except for a few diﬀerences due to ﬂoating pointapproximations.

Full Documentation : https://hasindu2008.github.io/f5c/docs/overview Pre-print : https://doi.org/10.1101/756122 B.1.1 Quick start

If you are a Linux user and want to quickly try out, download the compiled binaries from thelatest release. For example: VERSION = v0 .4 wget " https :// github . com / hasindu2008 / f5c / releases / download / $VERSION / f5c -$VERSION - binaries . tar . gz " && tar xvf f5c - $VERSION - binaries . tar . gz && cdf5c - $VERSION / ./ f5c_x86_64_linux ./ f5c_x86_64_linux_cuda Binaries should work on most Linux distributions and the only dependency is zlib which isavailable by default on most distros.

B.1.2 Building

Users are recommended to build from the latest release tar ball. You need a compiler thatsupports C++11. Quick example for Ubuntu : sudo apt - get install libhdf5 - dev zlib1g - dev VERSION = v0 .4 wget " https :// github . com / hasindu2008 / f5c / releases / download / $VERSION / f5c -$VERSION - release . tar . gz " && tar xvf f5c - $VERSION - release . tar . gz && cd f5c -$VERSION / scripts / install - hts . sh ./ configure make The commands to install hdf5 (and zlib) development libraries on some popular distribu-tions : On Debian / Ubuntu : sudo apt - get install libhdf5 - dev zlib1g - dev On Fedora / CentOS : sudo dnf / yum install hdf5 - devel zlib - devel On Arch Linux : sudo pacman -S hdf5 On OS X : brew install hdf5

If you skip scripts/install-hts.sh and ./configure hdf5 will be compiled locally. It isa good option if you cannot install hdf5 library system wide. However, building hdf5 takesages.Building from the Github repository additionally requires autoreconf which can be installedon Ubuntu using sudo apt-get install autoconf automake .Other building options are detailed in section B.2.Instructions to build a docker image isdetailed section B.2.4.

B.1.2.1 NVIDIA CUDA support

To build for the GPU, you need to have the CUDA toolkit installed. Make sure nvcc (NVIDIAC Compiler) is in your PATH.The building instructions are the same as above, except that you should call make as : make cuda =1 F5C

DOCUMENTATIONOptionally you can provide the CUDA architecture as : make cuda =1 CUDA_ARCH =- arch = sm_xy If your CUDA library is not in the default location /usr/local/cuda/lib64, point to the correctlocation as: make cuda =1 CUDA_LIB =/ path / to / cuda / library / Refer to section B.2.5 for troubleshooting CUDA related problems.

B.1.3 Usage f5c index -d [ fast5_folder ] [ read . fastq | fasta ] f5c call - methylation -b [ reads . sorted . bam ] -g [ ref . fa ] -r [ reads . fastq | fasta ]> [ meth . tsv ] f5c meth - freq -i [ meth . tsv ] > [ freq . tsv ] f5c eventalign -b [ reads . sorted . bam ] -g [ ref . fa ] -r [ reads . fastq | fasta ] > [events . tsv ] Refer to section B.2.8 for all the commands and options.

B.1.3.1 Example

Follow the same steps as in Nanopolish tutorial while replacing nanopolish with f5c . If youonly want to perform a quick test of f5c : wget -O f5c_na12878_test . tgz " https :// f5c . page . link / f5c_na12878_test " tar xf f5c_na12878_test . tgz f5c index -d chr22_meth_example / fast5_files chr22_meth_example / reads . fastq f5c call - methylation -b chr22_meth_example / reads . sorted . bam -gchr22_meth_example / humangenome . fa -r chr22_meth_example / reads . fastq >chr22_meth_example / result . tsv f5c meth - freq -i chr22_meth_example / result . tsv > chr22_meth_example / freq . tsv f5c eventalign -b chr22_meth_example / reads . sorted . bam -g chr22_meth_example /humangenome . fa -r chr22_meth_example / reads . fastq > chr22_meth_example /events . tsv B.1.4 Acknowledgement

This repository reuses code and methods from Nanopolish. The event detection code is fromOxford Nanopore’s Scrappie basecaller. Some code snippets have been taken from Minimap2and Samtools.

B.2 Building f5c

Note : Building from the Github repository requires autoreconf which can be installed onUbuntu using sudo apt-get install autoconf automake .Clone the git repository. git clone https :// github . com / hasindu2008 / f5c && cd f5c Alternatively, download the latest release tarball and extract. eg : VERSION = v0 .4 wget " https :// github . com / hasindu2008 / f5c / releases / download / $VERSION / f5c -$VERSION - release . tar . gz " && tar xvf f5c - $VERSION - release . tar . gz && cd f5c -$VERSION / F5C

DOCUMENTATIONWhile we have tried hard to avoid the dependency hell, three dependencies (zlib, HDF5 andHTS) could not be avoided.Currently 3 building methods are supported.1. Locally compiled HTS library and system wide HDF5 library (recommended).2. Locally compiled HTS and HDF5 libraries (HDF5 local compilation - takes a bit oftime).3. System wide HTS and HDF5 libraries (not recommended as HTS versions can be old).

B.2.1 Method 1 (recommended)

Dependencies : Install the HDF5 (and zlib development libraries). On Debian / Ubuntu : sudo apt - get install libhdf5 - dev zlib1g - dev On Fedora / CentOS : sudo dnf / yum install hdf5 - devel zlib - devel On Arch Linux : sudo pacman -S hdf5 On OS X : brew install hdf5

Now build f5c. autoreconf scripts / install - hts . sh ./ configure make B.2.2 Method 2 (time consuming)

Dependencies : Install the zlib development libraries.272.2. BUILDING F5C On Debian / Ubuntu : sudo apt - get install zlib1g - dev On Fedora / CentOS : sudo dnf / yum install zlib - devel

Now build f5c. autoreconf scripts / install - hts . sh scripts / install - hdf5 . sh ./ configure -- enable - localhdf5 make B.2.3 Method 3 (not recommended)

Dependencies : Install HDF5 and hts. On Debian / Ubuntu : sudo apt - get install libhdf5 - dev zlib1g - dev libhts1

Now build f5c. autoreconf ./ configure -- enable - systemhts make B.2.4 Docker Image

To build a docker image: git clone https :// github . com / hasindu2008 / f5c && cd f5c docker build . F5C

DOCUMENTATIONNote down the image uuid and run f5c as: docker run -v / path / to / local / data / data /:/ data / -it : image_id ./ f5c call -methylation -r / data / reads . fa -b / data / alignments . sorted . bam -g / data / ref .fa B.2.5 CUDA TroubleshootingB.2.6 Compiling Issues

B.2.6.1 make: nvcc: Command not found error when I compile with make cuda=1

Make sure that the NVIDIA CUDA toolkit is installed. See instruction at the oﬃcial instal-lation guide. If you still get this error after the toolkit installation, then nvcc is probably notin your PATH. In that case, either add the nvcc location to your PATH or manually specifythe nvcc location through a Makeﬁle variable.Example:If you installed the CUDA toolkit through apt in Ubuntu, make cuda =1 NVCC =/ usr / local / cuda / bin / nvcc If you did the Runﬁle installation on Ubuntu, make cuda =1 NVCC =/ usr / local / cuda -< toolkit - version >/ bin / nvcc Note that the location of nvcc might be diﬀerent depending on your distribution and theinstallation method. 274.2. BUILDING F5C

B.2.6.2 Cannot ﬁnd -lcudart_static error.

The default CUDA library path in the Makeﬁle is set to be /usr/local/cuda/lib64 .While this is the default path for an Ubuntu 64-bit system with the CUDA toolkit installedusing the package manager apt , it might be diﬀerent on your system. You can manuallyspecify the path to the cuda library when compiling.Example:If you did the Runﬁle installation on Ubuntu, make cuda =1 CUDA_LIB =/ usr / local / cuda -< toolkit - version >/ lib64 / If you are using Ubuntu 32-bit, make cuda =1 CUDA_LIB =/ usr / local / cuda / lib / Note that the location of the CUDA library path might be diﬀerent depending on your dis-tribution and the installation method.

B.2.6.3 memcpy was not declared in this scope error

If you get an error like this: $ make cuda =1 nvcc -x cu -g -O2 - std =c ++11 - lineinfo - Xcompiler - Wall -I ./ htslib - DHAVE_CUDA=1 - rdc = true -c src / f5c . cu -o build / f5c_cuda .o / usr / include / string .h: In function ’ void * __mempcpy_inline ( void *, const void *,size_t ) ’: / usr / include / string .h :652:42: error : ’ memcpy ’ was not declared in this scope return ( char *) memcpy ( __dest , __src , __n ) + __n ; ^ Makefile :76: recipe for target ’ build / f5c_cuda .o ’ failed

F5C

DOCUMENTATION make : *** [ build / f5c_cuda .o] Error 1 Compile with -D_FORCE_INLINES appended to

CUDA_CFLAGS when calling make : CUDA_CFLAGS += " - D_FORCE_INLINES " make cuda =1

The issue is reported in

B.2.7 Runtime Errors

B.2.7.1 CUDA driver version is insuﬃcient for CUDA runtime version

Check the following in order:1.

Do you have an NVIDIA GPU / is your NVIDIA GPU recognised by thesystem?

On most distributions you can use the following command to verify: lspci | grep -i " vga \|3 d \| display " It should list the NVIDIA GPU: Have you installed the NVIDIA driver (not the open source nouveau driver)?

On most distributions you can check your graphics card driver using lspci - nnk | grep - iA2 " vga \|3 d \| display " If the kernel driver output contains nvidia , then you are using the correct driver. Kernel driver in use : nvidia Kernel modules : nvidiafb , nouveau , nvidia_396 , nvidia_396_drm If you are using a Tegra GPU (e.g. Jetson TX2), does the current user be-long to the “video” user group?

Check the current group names with group [user] .4.

Is the CUDA driver version too old for the toolkit that is used to compilewith?

See the cuda binary compatibility guide.The release CUDA binary that we provide is compiled using CUDA toolkit 6.5. TheCUDA runtime library is statically linked and therefore release CUDA binaries work ondriver version >= 340.21.Use nvidia-smi to check your driver version. $ nvidia - smi +-----------------------------------------------------------------------------+ | NVIDIA - SMI 396.44 Driver Version : 396.44| If you compiled the binary yourself, see the cuda binary compatibility guide to check ifyour toolkit version and driver version match.

B.2.8 Commands and optionsB.2.9 Available f5c tools Usage : f5c < command > [ options ] command : F5C

DOCUMENTATION index Build an index mapping from basecalled reads tothe signals measured by the sequencer ( same as nanopolish index ) call - methylation Classify nucleotides as methylated or not (optimised nanopolish call - methylation ) meth - freq Calculate methylation frequency at genomic CpGsites ( optimised nanopolish calculate_methylation_frequency . py ) eventalign Align nanopore events to reference k - mers (optimised nanopolish eventalign ) B.2.9.1 Indexing Usage : f5c index [ OPTIONS ] -d nanopore_raw_file_directory reads . fastq Build an index mapping from basecalled reads to the signals measured by thesequencer f5c index is equivalent to nanopolish index by Jared Simpson -h , -- help display this help and exit -v , -- verbose display verbose output -d , -- directory path to the directory containing theraw ONT signal files . This option can be given multiple times . -s , -- sequencing - summary the sequencing summary file fromalbacore , providing this option will make indexing much faster -f , -- summary - fofn file containing the paths to thesequencing summary files ( one per line ) B.2.9.2 Calling methylation Usage : f5c call - methylation [ OPTIONS ] -r reads . fa -b alignments . bam -g genome .fa -r FILE fastq / fasta read file -b FILE sorted bam file -g FILE reference genome -w STR [ chr : start - end ] limit processing to genomic region STR -t INT number of threads [8] -K INT batch size ( max number of reads loaded at once )[512] -B FLOAT [K/M/G] max number of bases loaded at once [2.0 M] -h help -o FILE output to file [ stdout ] -- iop INT number of I/O processes to read fast5 files [1] --min - mapq INT minimum mapping quality [30] -- secondary = yes | no consider secondary mappings or not [ no ] -- verbose INT verbosity level [0] -- version print version -- disable - cuda = yes | no disable running on CUDA [ no ] -- cuda - dev - id INT CUDA device ID to run kernels on [0] -- cuda - max - lf FLOAT reads with length <= cuda - max - lf * avg_readlen onGPU , rest on CPU [3.0] -- cuda - avg - epk FLOAT average number of events per kmer - forallocating GPU arrays [2.0] -- cuda - max - epk FLOAT reads with events per kmer <= cuda_max_epk onGPU , rest on CPU [5.0] -x STRING profile to be used for optimal CUDA parameterselection . user - specified parameters will override profile values advanced options : -- kmer - model FILE custom k - mer model file -- skip - unreadable = yes | no skip any unreadable fast5 or terminate program [yes ] -- print - events = yes | no prints the event table -- print - banded - aln = yes | no prints the event alignment -- print - scaling = yes | no prints the estimated scalings -- print - raw = yes | no prints the raw signal -- debug - break [ INT ] break after processing the specified batch -- profile - cpu = yes | no process section by section ( used for profilingon CPU ) -- skip - ultra FILE skip ultra long reads and write those entries to F5C

DOCUMENTATION the bam file provided as the argument -- ultra - thresh [ INT ] threshold to skip ultra long reads [100000] -- write - dump = yes | no write the fast5 dump to a file or not -- read - dump = yes | no read from a fast5 dump file or not -- meth - out - version [ INT ] methylation tsv output version ( set 2 to printthe strand column ) [1] -- cuda - mem - frac FLOAT Fraction of free GPU memory to allocate [0.9(0.7 for tegra )] B.2.9.3 Calculate methylation frequency Usage : meth - freq [ options ...] -c [ float ] Call threshold . Default is 2.5. -i [ file ] Input file . Read from stdin if not specified . -o [ file ] Output file . Write to stdout if not specified . -s Split groups B.2.9.4 Aligning events Usage : f5c eventalign [ OPTIONS ] -r reads . fa -b alignments . bam -g genome . fa -r FILE fastq / fasta read file -b FILE sorted bam file -g FILE reference genome -w STR [ chr : start - end ] limit processing to genomic region STR -t INT number of threads [8] -K INT batch size ( max number of reads loaded at once )[512] -B FLOAT [K/M/G] max number of bases loaded at once [2.0 M] -h help -o FILE output to file [ stdout ] -- iop INT number of I/O processes to read fast5 files [1] --min - mapq INT minimum mapping quality [30] -- secondary = yes | no consider secondary mappings or not [ no ] -- verbose INT verbosity level [0] -- version print version -- disable - cuda = yes | no disable running on CUDA [ no ] -- cuda - dev - id INT CUDA device ID to run kernels on [0] -- cuda - max - lf FLOAT reads with length <= cuda - max - lf * avg_readlen onGPU , rest on CPU [3.0] -- cuda - avg - epk FLOAT average number of events per kmer - forallocating GPU arrays [2.0] -- cuda - max - epk FLOAT reads with events per kmer <= cuda_max_epk onGPU , rest on CPU [5.0] -x STRING profile to be used for optimal CUDA parameterselection . user - specified parameters will override profile values advanced options : -- kmer - model FILE custom k - mer model file -- skip - unreadable = yes | no skip any unreadable fast5 or terminate program [yes ] -- print - events = yes | no prints the event table -- print - banded - aln = yes | no prints the event alignment -- print - scaling = yes | no prints the estimated scalings -- print - raw = yes | no prints the raw signal -- debug - break [ INT ] break after processing the specified batch -- profile - cpu = yes | no process section by section ( used for profilingon CPU ) -- skip - ultra FILE skip ultra long reads and write those entries tothe bam file provided as the argument -- ultra - thresh [ INT ] threshold to skip ultra long reads [100000] -- write - dump = yes | no write the fast5 dump to a file or not -- read - dump = yes | no read from a fast5 dump file or not -- summary FILE summarise the alignment of each read / strand inFILE -- sam write output in SAM format -- print - read - names print read names instead of indexes F5C

DOCUMENTATION -- scale - events scale events to the model , rather than vice -versa -- samples write the raw samples for the event to the tsvoutput -- cuda - mem - frac FLOAT Fraction of free GPU memory to allocate [0.9(0.7 for tegra )] ppendix C Supplementary Materials - f5c

C.1 Why Nanopolish had to be re-engineered?

There are three reasons why

Nanopolish had to be completely re-engineered into f5c for asuccessful GPU implementation.•

Nanopolish performs on-demand loading of signal data from ﬁle (a CPU thread assignedto the particular read invokes a ﬁle access just prior to signal alignment). However,transferring read by read to the GPU will incur a massive penalty and thus a batch ofreads have to be transferred at once. Thus, we had to re-write the Nanopolish processingframework in such a way that loading and processing of a batch are performed batchwise. In f5c , we read a batch of data to the RAM and then bulk transfer to GPUmemory, a batch of n reads at a time.• Nanopolish thread model un-suitable for GPU acceleration—a thread is dynamicallyassigned to a read using openMP, thus each read has its own code path. However,oﬄoading a batch of reads to the GPU for signal alignment requires code paths of all283PPENDIX C. APPENDIX: F5C the reads in the batch to have converged before the GPU kernel is invoked. In addi-tion, accurately measuring time, benchmarking and proﬁling of individual algorithmiccomponents is hindered by such divergent code paths. pthread based approach thatinterleaves input reading, processing and output.• Nanopolish is not optimised for eﬃcient resource utilisation (eg: marginal performanceimprovement beyond 16 threads on servers and heavy-weight for embedded systems dueto spurious malloc calls). A comparison of such a version with the GPU would result inan apparent high speedup, which is unfair.

C.2 Additional advantages of f5c over

Nanopolish

In addition to the GPU acceleration of ABEA, f5c has many additional advantages overoriginal

Nanopolish .• I/O and processing are interleaved in f5c : the I/O latency is considerably minimised.• Our CPU version alone is around 1.5X-2X faster than the

Nanopolish call methylationimplementation and is very lightweight - suitable for embedded systems due to thecareful use of data structures and algorithms.• f5c is capable of detecting load balance problems between CPU and GPU, and reportuser with suggestion for appropriate parameters.• f5c works with package manager’s system wide installations of HDF5 (no need of thread-safe build of HDF5), hence no need locally compile HDF5.• Dependency hell has been minimised for both CPU and GPU versions. Compatiblewith g++ 4.8 or higher, and CUDA toolkit 6.5 or higher.284.2. ADDITIONAL ADVANTAGES OF

F5C

OVER

NANOPOLISH • f5c has suggestive error message for troubleshooting, especially the issues with respectto GPU.• Pthread based thread framework written in C that interleaves I/O with processing isvery lightweight and can be a starting point for future Nanopore tools.• f5c allows benchmarking section by section to identify the bottlenecks in performance.• f5c framework is suitable for the acceleration of core kernels through other methodssuch as FPGA. 285PPENDIX D. APPENDIX: PORTABLE BINARIES

Appendix D

Generating portable binaries forONT tools

This appendix is based on a blog article published at https://hasindu2008.github.io/portable-binary .Compiling software can sometimes be a nightmare due to numerous dependencies. This isspeciﬁcally the case for bioinformatics tools that utilise signal level data from Oxford Nanopore(ONT) sequencers. According to my experience, the major cause behind compilation troublesin ONT tools is the Hierarchical Data Format 5 (HDF5) library . While a system admin Currently, the raw signal data from the ONT sequencers are stored in HDF5 ﬁle format.Possibly due to the complexity of HD5, there are no alternate library implementations thanthe oﬃcial library from the HDF Group. Compiling the HDF5 library takes time. Luckily,package managers’ versions of HDF5 library exists, but there seem to be some inconsistenciesacross various distributions. Thus, a software developed on one Linux distribution will rarelycompile without any trouble on a diﬀerent system. For example, the header ﬁle resides directly include directory on certain systems, while in some other system it can be286.1. KEY POINTSmay enjoy tedious compilations, it is not the case for users of bioinformatics tools. Whatif the tool developers release pre-compiled binaries? Some would object this as it is not aperfect solution. Nevertheless, I believe that it is far better than releasing an unusable tooldue to users giving up at the compilation stage. Further, pre-compiled binaries are less bulkycompared to docker images. Generating a “portable binary” that runs on numerous Linuxdistributions/version is tricky, but possible with some additional work from the developer’sside.Explained below is a recipe (or probably some key points) to generate “portable binaries”for ONT tools. In summary, this strategy uses a combination of static linking and dynamiclinking to generate a “portable binary”. Dynamically linking all libraries means that the userwould have to install the exact version of the library as the developer. On the other end,statically linking everything is also not ideal . Thus, a hybrid static and dynamic linkingstrategy is the way to go. However, this recipe is limited to C/C++ based tools. In addition,portability here means that a binary compiled for a particular operating system would work onother distributions (and versions) of the same operating system. For example, the binary forLinux on x86_64 architecture would run despite the distribution (whether Ubuntu, Debian,Fedora or Red hat) or the version. Portability here does NOT mean that the Linux versionwill run on Windows or that x86_64 version will run on ARM. D.1 Key points

The key points for a successful portable binary are: or . Static linking is not ideal due to libraries such as GNU libc being non-portable (see http://stevehanov.ca/blog/?id=97 . But, many reasons why statically linked bina-ries are good for bioinformatics tools are detailed in http://lh3.github.io/2014/07/12/about-static-linking . 287PPENDIX D. APPENDIX: PORTABLE BINARIES1. Avoid unnecessary dependencies as much as possible.2. Identify libraries which causes problems when dynamically linked. Such libraries aregood candidates for linking statically. Examples are:• libraries which do not honour backward compatibility• non-mature libraries which the API frequently change• the name/location of the shared object (.so) ﬁle installed by the package manageris diﬀerent on diﬀerent distributions.3. Identify libraries which are better to be left dynamically linked. The best exampleis glibc which is not recommended to be statically linked. Luckily glibc suﬃcientlymaintains backward compatibility and can be left dynamically linked.4. Generate the binaries on a machine (a virtual machine is suﬃcient) with an old Linuxdistribution (eg: Ubuntu 14 or even better if Ubuntu 12) installed with older libraries.For instance, glibc which we decided to be left dynamically linked is NOT forward-compatible .5. Try to avoid package manager’s version for libraries when statically linking. Instead,compile those libraries yourself with minimal features that you require for your tool.For instance, statically linking HDF5 package manager’s version, also require linkingadditional libraries such as libsz (a lossless compression library for scientiﬁc data) and libaec (Adaptive Entropy Coding library). Those can be avoided if we compile HDF5ourselves without those features (if your tool does not require those additional features). binaries compiled for an older glibc version will run on a system with a newer glibc (glibcis backward compatible). However, binaries compiled for newer glibc versions will not alwayswork with an older glibc (glibc is not forward compatible).288.2. A CASE STUDY WITH F5C

D.2 A case study with f5c

Now let’s go through the above points with reference to f5c, a tool which we are currentlydeveloping that utilises ONT raw data.1. We tried our best to avoid dependencies. However, three external dependencies HDF5,HTSlib (high-throughput sequencing), zlib (data compression library) and obviouslystandard libraries (such as glibc, pthreads) could not be avoided. We generate thebinaries for f5c on Ubuntu 14.2. HDF5 and HTSlib are statically linked. The location of the .so ﬁle of HDF5 is not con-sistent across distributions and even diﬀerent versions in the same distribution. HTSlibthat comes with the package manager is an older version and f5c required a newer ver-sion to support long reads. Thus, we statically link HDF5 and HTSlib. For the CUDAsupported version of f5c, we statically link the CUDA runtime library as well, which isexplained later.3. zlib and other standard libraries are dynamically linked. Executing the command ldd ona release binary of f5c (“portable binary”) gives the list of dynamically linked librariesshown below. Note that HD5F and HTSlib were statically linked and thus not seen inthe ldd output. $ldd ./ f5c linux - vdso . so .1 => (0 x00007fffc91fb000 ) libdl . so .2 => / lib / x86_64 - linux - gnu / libdl . so .2 (0 x00007f61550d0000 ) libpthread . so .0 => / lib / x86_64 - linux - gnu / libpthread . so .0 (0x00007f6154eb0000 ) libz . so .1 => / lib / x86_64 - linux - gnu / libz . so .1 (0 x00007f6154c90000 ) libstdc ++. so .6 => / usr / lib / x86_64 - linux - gnu / libstdc ++. so .6 (0x00007f61548f0000 ) libm . so .6 => / lib / x86_64 - linux - gnu / libm . so .6 (0 x00007f61545e0000 ) libgcc_s . so .1 => / lib / x86_64 - linux - gnu / libgcc_s . so .1 (0x00007f61543c0000 ) libc . so .6 => / lib / x86_64 - linux - gnu / libc . so .6 (0 x00007f6153fe0000 ) / lib64 /ld - linux - x86 -64. so .2 (0 x00007f6155400000 )

4. We compile on a virtual machine with Ubuntu 14.5. To highlight why compiling external libraries ourselves with minimal features, let ussee the additional output of ldd when f5c was dynamically linked with the packagemanagers’ HDF5 (see below). Observe that now in addition to the actual HDF5 library( libhdf5_serial.so ) we have got two additional dependencies ( libsz.so and libaec.so ).Thus, if one is to statically link this package manager’s HDF5 version, then libsz and libaec also would have to be statically linked. Compiling HDF5 ourselves let us dropthese features which we do not want. $ldd ./ f5c ... libhdf5_serial . so .10 => / usr / lib / x86_64 - linux - gnu / libhdf5_serial . so .10(0 x00007f1b21f30000 ) libsz . so .2 => / usr / lib / x86_64 - linux - gnu / libsz . so .2 (0 x00007f1b20c30000) libaec . so .0 => / usr / lib / x86_64 - linux - gnu / libaec . so .0 (0x00007f1b20800000 ) ... D.2.1 Note on CUDA libraries

CUDA runtime is not both forward and backward compatible and requires the exact versionto be installed. Hence dynamically linked CUDA runtime is of no much use. Luckily CUDAruntime library has been designed to support static linking. In fact, NVIDIA recommends290.2. A CASE STUDY WITH

F5C statically compiling the CUDA runtime library (refer the CUDA best practices guide) andthe default behaviour of the CUDA C compiler (nvcc) 5.5 or is to statically link the CUDAruntime. However, CUDA runtimes are coupled with CUDA driver versions. NVIDIA statesthat CUDA Driver API is backward compatible but not forward compatible (see here) andthus CUDA Runtime compiled against a particular Driver will work on later driver releases,but may not work on earlier driver versions. As a result, generating the binary should betterbe done with an old CUDA toolkit version. Otherwise, the users will have to install the latestdrivers to run this binary. For f5c we installed the CUDA 6.5 toolkit version on the Ubuntu14 virtual machine to generate CUDA binaries.

D.2.2 Example commands

Assume we have HDF5 and HTSlib locally compiled and the static libraries (libhdf5.a andlibhts.a) are located in ./build/lib/. These libraries are statically linked as : < gcc /g ++ > [ options ] < object1 .o > < object2 .o > <... > build / lib / libhdf5 .a - ldlbuild / lib / libhts .a - lpthread -lz -o binary To statically link the CUDA runtime when using gcc or g++: < gcc /g ++ > [ options ] < object1 .o > < object2 .o > <... > build / lib / libhdf5 .a build /lib / libhts .a -L/ usr / local / cuda / lib64 - lcudart_static - lpthread -lz - lrt- ldl -o binary Alternatively if CUDA toolkit 5.5 higher NVIDIA C compiler nvcc links the CUDA runtimestatically by default: nvcc [ options ] < object1 .o > < object2 .o > <... > build / lib / libhdf5 .a build / lib /libhts .a - lpthread -lz - lrt - ldl -o binary After generating the binary issue the ldd command to verify if the intended ones are staticallylinked. The output of ldd lists the dynamically linked libraries and the statically linked291PPENDIX D. APPENDIX: PORTABLE BINARIESlibraries should NOT appear in this output. ldd ./ binary ppendix E Appendix: Rock64-cluster and f5p

This appendix is based on the documentation associated with the GitHub repositories at https://github.com/hasindu2008/nanopore-cluster and https://github.com/hasindu2008/f5p . E.1 Rock64-cluster

E.1.1 Required Hardware

A cluster of computers connected to each other using Gigabit Ethernet. We built our clusterusing 16 Rock64 single board computers and the list of items we used are:• Rock64 single board computers (4GB RAM)• Rock64 heat sinks 293PPENDIX E. APPENDIX: ROCK64-CLUSTER AND

F5P • 64 GB eMMC modules• USB to type H barrel 5V DC power cables• Orico DUB-8P-BK USB charging stations (as power supplies for Rock64 devices)• Copper Cylinders for Raspberry Pi• HPE OﬃceConnect 1950 24G 2SFP+ 2XGT switch• Ethernet cables• USB adaptor for eMMC modules (to ﬂash eMMC)

E.1.2 Connecting nodes together • Build the cluster.• Connect the nuts and bolts (using copper cylinders).• Flash Linux distributions onto eMMCs (we ﬂashed Ubuntu).• Plug eMMCs and heat sinks to the Rock64s.• Connect Rock64s onto the switch and power supplies.• Conﬁgure the switch.• Assign IP addresses to the Rock64 devices.• Provide Internet to the Rock64 devices.294.1. ROCK64-CLUSTER

E.1.3 Setting up the head node

The node which will be used to control, connect and assign work to the other worker nodes is referred to as the head node . This head node can be a Rock64 itself or any other computer.We used an old PC as the head node . On the head node you may want to do the following.• Install and conﬁgure ansible . Ansible will be used to launch commands on all workernodes , centrally from the head node .• Install ansible . On Ubuntu: sudo apt - add - repository ppa : ansible / ansible sudo apt update sudo apt install ansible • Conﬁgure ansible . You need to edit your /etc/ansible/ansible.cfg and /etc/ansible/hosts . Our sample conﬁg ﬁles are at scripts/sample_config/ansible .• Create an SSH key on the head node. You can use the command ssh-keygen . This keyis needed for password-less access to the worker nodes .• Mount the network attached storage. You can add an entry to the /etc/fstab forpersist across reboots.• Optionally, you can install ganglia to monitor various metrics of the nodes. In Ubuntuyou may use: sudo apt - get install ganglia - monitor rrdtool gmetad ganglia - webfrontend sudo cp / etc / ganglia - webfrontend / apache . conf / etc / apache2 / sites - enabled/ ganglia . conf F5P

You have to edit the conﬁguration ﬁles /etc/ganglia/gmetad.conf and/etc/ ganglia/gmond.conf . Our sample conﬁguration ﬁles are at scripts/sample_config/ganglia . In summary, commands are:You may refer to the tutorial [here] on installing and conﬁguring ganglia.• Optionally, you can conﬁgure rsyslog and LogAnalyzer to centrally view the logs througha web browser.• Add path of scripts/system to PATH.

E.1.4 Compiling software and preparing the folder structure

On one of your nodes (rock64 devices):• Compile the software. We compiled minimap2, nanopolish, samtools and f5c for ARMarchitecture.• Create folder named nanopore under / .• Put compiled binaries to a folder named /nanopore/bin .• Put the reference genome and a minimap2 index under /nanopore/reference .• Create a folder named /nanopore/scratch for later use.The directory structure should look like bellow : nanopore | __ bin | | __ f5c | | __ minimap2 - arm | | __ nanopolish | | __ samtools | __ reference | | __ hg38noAlt . fa | | __ hg38noAlt . fa . fai | | __ hg38noAlt . idx | __ scratch E.1.5 Setting up woker nodes

On the worker node :• Change device name.• Change the time zone (and conﬁgure ntp).• Perform apt update and package installation eg: nfs-common ganglia-monitor.• Mount the network attached storage.• Create a swap space.• Copy the binaries and the folder structure we constructed before.A shell script that perform the above is available at scripts/new_workernode_setup/run_on_workernode.sh .On the head node :• Copy ssh-key to the worker node .• Copy ganglia conﬁguration ﬁles to the worker node .297PPENDIX E. APPENDIX: ROCK64-CLUSTER AND

F5P • Copy rsyslog conﬁguration ﬁles to the worker node .A shell script that perform the above is available at scripts/new_workernode_setup/run_on_headnode.sh . E.2 f5p f5p is a lightweight job scheduler and daemon for nanopore data processing on a nanoporemini-cluster.

E.2.1 Pre-requisites • A compute-cluster composed of devices running Linux connected to each other prefer-ably using Ethernet.• One of the devices will act as the head node to issue commands to other worker nodes .• A shared network mounted storage for storing data.• SSH key based access from head node to worker nodes .• Optionally you may conﬁgure ansible to automate conﬁguration tasks. E.2.2 Getting startedE.2.3 Building and initial conﬁguration

1. First build the scheduling daemon ( f5pd ) and client ( f5pl ). make F5P

2. Scheduling client ( f5pl ) is destined for the head node . Copy the scheduling daemon( f5pd ) to all worker nodes . If you have conﬁgured ansible, you adapt the followingcommand. ansible all -m copy -a " src =./ f5pd dest =/ nanopore / bin / f5pd mode =0755 "

3. Run the scheduling daemon ( f5pd ) on all worker nodes . You may want to add ( f5pd )as a systemd service that runs on the start-up. See scripts/f5pd.service for an example systemd conﬁguration and scripts/install_f5pd_service.sh for an example script.4. On the head node create a ﬁle containing the list of IP addresses of the worker nodes ,one IP address per line. An example is in data/ip_list.cfg.5. Optionally, you may install a web server on the head node and host the scripts underscripts/front to view the log on a web-browser. You will need to edit the paths in thesescripts to point to the log location. Note that these scripts are not probably safe to behosted on a public server.

E.2.4 Running for a dataset

1. Modify the shell script scripts/fast5_pipeline.sh for your use-case. This script is to becalled on worker nodes by ( f5pd ), each time a data unit is assigned. The example script:• takes a location of a tar ﬁle on the network mount (which contains a batch of fast5 ﬁles) as the argument;• deduce the location of fastq ﬁle on the network mount associated to the tar ﬁle;• copy the tar ﬁle and fastq ﬁle to the local storage;• runs a methylation-calling pipeline that uses the tools minimap2 , samtools and nanopolish ; and,• copy the results back to the network mount.299PPENDIX E. APPENDIX: ROCK64-CLUSTER AND F5P

Note that this scripts should exit with a non zero status if any thing went wrong. Aftermodifying the script, copy it to the worker nodes to the location /nanopore/bin/fast5_pipeline.sh

2. On the head node create a ﬁle containing the list of tar ﬁles (each tar ﬁle contains afast5 batch), one tar ﬁle per line. An example is in data/ﬁle_list.cfg.3. Launch the f5pl with the IP list and the tar ﬁle list you previously created as thearguments. ./ f5pl data / ip_list . cfg data / file_list . cfg You may adapt the script scripts/run.sh which performs a run discussed above.300 ppendix F

Getting Command lineBioinformatics Tools Working onAndroid

This appendix is based on the blog articles published at https://hasindu2008.github.io/linux-tools-on-phone and https://hasindu2008.github.io/linux-tools-on-phone2 .This is a very hacky method and is solely for testing out. In summary, we generate a com-pletely statically linked binary on an ARM based single board computer running Linux. Themethod is only for tools written in C/C++. I will show steps for four examples, namelyminimap2, samtools, f5c and nanopolish.Then statically linked binaries which we generated can be downloaded from http://bit. ly/2INNeRv . The sample data for the following examples can be downloaded from http://bit.ly/2XOK1Yg F.1 Requirements • A mobile phone running Android. Does not require rooting . My phone used for testingwas a cheap LG Q6 phone running Android 7.• An ARM based single board computer (will call it SBC here onwards) running Linux.We used an Odroid XU4 running Ubuntu 16.04.4 LTS.• A USB cable to connect your phone. Optionally a host computer (laptop or a desktop)to connect the phone. Even the SBC can be used as the host.You might wonder if the mobile phone and the SBC) should have the same ARM architecture(i.e. ARMv7 or ARMv8). Not necessarily. The LG Q6 mobile phone had an ARMv8 (Octa-core 1.4 GHz Cortex-A53) processor architecture while the Odroid XU4 had ARMv7 (Cortex-A15 2Ghz and Cortex-A7 Octa core). However, the mobile phone despite its ARMv8 64-bitprocessor, was still running a 32-bit version of the OS, thus running ‘cat /proc/cpu’ on thephone through Android Debug Bridge (ADB) output the following (a similar outcome to thaton a latest Raspberry Pi with ARMv8 processor running the 32-bit Raspbian). mh :/ data / local / tmp $ cat / proc / cpuinfo processor : 0 model name : ARMv7 Processor rev 4 ( v7l ) BogoMIPS : 38.40 Contains chromosome 22, a small set of NA12878 Nanopore reads and some E.coliNanopore reads from the Nanopolish tutorial. At the time of writing Android (tested on Android 7 and 8) seem to allow executingbinaries from ‘/data/local/tmp’ through the Android Debug Bridge (ADB). As long as thisis not blocked in the future versions, the method should work.302.2. STEPS Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idivaidivt vfpd32 lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0 x41 CPU architecture : 7 CPU variant : 0 x0 CPU part : 0 xd03 CPU revision : 4

In case your mobile is running a 64-bit OS, you might need an SBC running 64-bit as well.

F.2 Steps

1. Setup the Android Debug Bridge (ADB)You have to setup your host computer to be able to connect to your phone through ADB.There are a number of tutorials for this on the Internet which you can follow. For ex-ample see https://devsjournal.com/download-minimal-adb-fastboot-tool.html .Note that this step might slightly vary for diﬀerent Android phones. This is the sum-mary of what we did:• Installed the minimal version of ADB. The ADB command line tool comes withthe Android SDK, but we preferred the minimal version of ADB as it is lightweight. For Windows, you can download minimal ADB from https://forum.xda-developers.com/showthread.php?t=2317790 . For Linux, you may use thepackage manager (eg : ‘sudo apt-get install android-tools-adb android-tools-fastboot’).• Installed the USB drivers for the phone. We used the OEM version (through themanufacturer website given at https://developer.android.com/studio/run/oem-usb . Even the Universal ADB driver should work for most phones.• Enabled developer options on Android and then allowed USB debugging.303PPENDIX F. APPENDIX: BIOINFORMATICS ON MOBILE PHONE• Connected the phone through USB to the computer. Opened a command line onthe computer and issued ‘adb devices’ command. If everything is successful, thephone connected to the computer should be listed. C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb devices List of devices attached LGM70059258dab device If your phone is not listed (usually it happens to me most of the time due toincompatible driver or ADB versions etc) you will have to do a bit of playingaround with some patience.2. Download the source code of the tool onto the SBC and compile with ‘-static’ option togenerate a statically linked binary. See examples in the next section.3. Copy the static binary to the location ‘/data/local/tmp’ of the mobile phone usingthe ‘adb push‘ command. This location ‘/data/local/tmp’ allows us setting executablepermissions and running a binary through ADB. This location works up till Android8.1.0 version. Hopefully will not be restricted in the future versions. C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb push "/ path / to /binary " / data / local / tmp /

4. Launch an ‘adb shell’ (will give us a shell on the phone) and set executable permissionto the binary we just copied. Then you can execute the binary on the phone. C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb shell mh :/ $ cd / data / local / tmp mh :/ data / local / tmp $ chmod +x binaryname mh :/ data / local / tmp $ ./ binaryname F.3 Examples

F.3.1 minimap2

1. First, download the minimap2 source code on to the SBC. This example uses my fork ofminimap2 which was patched to support ARM. You may also use version 2.7 or higherfrom the original minimap2 repository at https://github.com/lh3/minimap2 whichsupports ARM. wget -O minimap2 - arm . tar . gz " https :// github . com / hasindu2008 / minimap2 -arm / archive / v0 .1. tar . gz " && tar xvf minimap2 - arm . tar . gz && cdminimap2 - arm -0.1/

2. Open the

Makeﬁle (located inside the extracted source code directory) using a texteditor and get rid of getopt.o by changing line 35 and 36 in

Makeﬁle from: make minimap2 : main .o getopt .o libminimap2 .a $( CC ) $( CFLAGS ) main .o getopt .o -o $@ -L. - lminimap2 $( LIBS ) to make minimap2 : main .o libminimap2 .a $( CC ) $( CFLAGS ) main .o -o $@ -L. - lminimap2 $( LIBS ) This is to prevent the potential compilation error in the next step (i.e. multiple deﬁnitionof ‘getopt’ due to that in getopt.c in current folder and the one in libc ). Note that inlatest minimap2 versions, getopt.c has been changed to ketopt.c and this step is notrequired. 305PPENDIX F. APPENDIX: BIOINFORMATICS ON MOBILE PHONE3. Compile with the ‘-static’ option by passing the ‘CC’ variable in Make as ‘gcc -static’.You will need to have the dependency zlib development ﬁles installed (package managercan be used, eg: ‘sudo apt-get install zlib1g-dev’). $ make arm_neon =1 CC =" gcc - static "

4. Make sure that the generated binary is statically linked. $ ldd ./ minimap2 not a dynamic executable

5. Copy this binary to your mobile phone through ADB. We ﬁrst copied the binary fromthe SBC to the laptop and then issued: C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb push "C :\ Users \hasindu \ Desktop \ minimap2 " / data / local / tmp / C :\ Users \ hasindu \ Desktop \ minimap2 : 1 file pushed . 14.0 MB /s (1470676bytes in 0.100 s)

6. Provide executable permissions and launch minimap2 without arguments on your phoneto see the usage message. C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb shell mh :/ $ cd / data / local / tmp mh :/ data / local / tmp $ ls -l minimap2 -rw -rw -rw - 1 shell shell 1470676 2019 -06 -15 17:31 minimap2 mh :/ data / local / tmp $ chmod +x minimap2 mh :/ data / local / tmp $ ./ minimap2 Usage : minimap2 [ options ] < target .fa >| < target . idx > [ query . fa ] [...] .... /sdcard/genome/ on my phone(chr22.fa and 740475-67.fastq in our test dataset (available at http://bit.ly/2XOK1Yg ).You can use ‘adb push’ or the Windows Explorer based phone browser.8. Now align some reads to the reference. We ran with 4 threads instead of 8 threads asthe phone otherwise got laggy. The ‘-K5M’ option to limit the batch size to cap thepeak memory (my phone had only 3GB of RAM). Note that chr22 reference is smalland ﬁts adequately to 2GB RAM. If you want to align to a full human genome on alimited memory system see chapter 4 and appendix A. [M :: mm_idx_gen ::8.923*0.99] collected minimizers [M :: mm_idx_gen ::10.035*1.27] sorted minimizers [M :: main ::10.035*1.27] loaded / built the index for 1 target sequence (s) [M :: mm_mapopt_update ::10.394*1.26] mid_occ = 136 [M :: mm_idx_stat ] kmer size : 15; skip : 10; is_hpc : 0; [M :: mm_idx_stat ::10.617*1.26] distinct minimizers : 4817802 (89.47% aresingletons ); average occurrences : 1.368; average spacing : 7.784 [M :: worker_pipeline ::18.912*2.37] mapped 493 sequences [M :: worker_pipeline ::52.854*2.01] mapped 413 sequences [M :: worker_pipeline ::64.470*2.33] mapped 443 sequences [M :: worker_pipeline ::109.708*2.07] mapped 457 sequences [M :: worker_pipeline ::151.487*2.32] mapped 454 sequences [M :: worker_pipeline ::162.448*2.42] mapped 317 sequences [M :: worker_pipeline ::174.190*2.52] mapped 410 sequences [M :: worker_pipeline ::183.692*2.59] mapped 496 sequences [M :: worker_pipeline ::190.814*2.63] mapped 301 sequences [M :: main ] Version : 2.11 - r797 [M :: main ] CMD : ./ minimap2 -x map - ont -a -t4 - K5M / sdcard / genome / chr22 .fa / sdcard / genome /740475 -67. fastq [M :: main ] Real time : 190.969 sec ; CPU : 501.840 sec mh :/ data / local / tmp $ ls -l / sdcard / genome /740475 -67. fastq -rw -rw ---- 1 root sdcard_rw 85784776 2018 -06 -29 19:39 / sdcard / genome/740475 -67. fastq F.3.2 Samtools

1. Download samtools source code. wget -O samtools . tar . gz " https :// github . com / samtools / samtools / releases /download /1.9/ samtools -1.9. tar . bz2 " && tar - xvf samtools . tar . gz && cdsamtools -1.9/

2. Compile with ‘-static’. You need to have dependencies installed or else disable unwantedcomponents through ﬂags to ./conﬁgure. See oﬃcial Samtools installation documenta-tion at . ./ configure CC =" gcc - static " -- without - curses make

3. Verify if statically linked. $ ldd ./ samtools not a dynamic executable

4. Copy the binary to your phone. 308.3. EXAMPLES (a) CPU Usage (b) RAM usage

Figure F.1: CPU and RAM usage C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb push "C :\ Users \hasindu \ Desktop \ samtools " / data / local / tmp / C :\ Users \ hasindu \ Desktop \ samtools : 1 file pushed . 9.3 MB /s (4859024bytes in 0.496 s)

5. Set executable permissions and run. Output from minimap2 above (reads.sam) is sortedand then indexed in the example below. C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb shell mh :/ $ cd / data / local / tmp / mh :/ data / local / tmp $ chmod +x samtools mh :/ data / local / tmp $ ./ samtools sort / sdcard / genome / reads . sam > /sdcard / genome / reads . bam mh :/ data / local / tmp $ ./ samtools index / sdcard / genome / reads . bam F.3.3 F5C

1. Download the source code and compile statically as follows. Library compilation willtake time, bare with patience. wget -O f5c . tar . gz https :// github . com / hasindu2008 / f5c / releases / download/ v0 .1 - beta / f5c - v0 .1 - beta - release . tar . gz && tar xvf f5c . tar . gz && cdf5c - v0 .1 - beta / scripts / install - hts . sh scripts / install - hdf5 . sh ./ configure -- enable - localhdf5 make CXX ="g ++ - static "

2. Copy the binary to the phone as in previous examples. Also, copy a set of Nanoporedata including fast5 ﬁles (ecoli_2kb_region in our test dataset available at http://bit.ly/2XOK1Yg ). Then index and perform methylation calling using f5c as below.

1| mh :/ data / local / tmp $ ./ f5c index -d / sdcard / genome / ecoli_2kb_region /fast5_files / / sdcard / genome / ecoli_2kb_region / reads . fasta [ readdb ] indexing / sdcard / genome / ecoli_2kb_region / fast5_files / [ readdb ] num reads : 112 , num reads with path to fast5 : 112

1| mh :/ data / local / tmp $ ./ f5c call - methylation -r / sdcard / genome /ecoli_2kb_region / reads . fasta -g / sdcard / genome / ecoli_2kb_region / draft. fa -b / sdcard / genome / ecoli_2kb_region / reads . bam > / sdcard / genome /ecoli_2kb_region / ref . tsv [ meth_main ::1.595*0.98] 125 Entries (0.7 M bases ) loaded [ pthread_processor ::11.151*6.09] 125 Entries (0.7 M bases ) processed [ meth_main ] total entries : 125 , qc fail : 0, could not calibrate : 0, noalignment : 0, bad fast5 : 0 [ meth_main ] total bases : 0.7 Mbases [ meth_main ] Data loading time : 1.419 sec [ meth_main ] - bam load time : 0.021 sec [ meth_main ] - fasta load time : 0.353 sec [ meth_main ] - fast5 load time : 1.041 sec [ meth_main ] - fast5 open time : 0.195 sec [ meth_main ] - fast5 read time : 0.818 sec [ meth_main ] Data processing time : 9.555 sec [ main ] CMD : ./ f5c call - methylation -r / sdcard / genome / ecoli_2kb_region /reads . fasta -g / sdcard / genome / ecoli_2kb_region / draft . fa -b / sdcard /genome / ecoli_2kb_region / reads . bam [ main ] Real time : 11.417 sec ; CPU time : 68.170 sec ; Peak RAM : 0.143 GB F.3.4 Nanopolish

1. Download the source code and compile statically as follows. Library compilation willtake time, bare with patience. This example uses my fork of nanopolish patched for ARMsupport. You may also use v0.11.0 or higher from the original nanopolish repository at https://github.com/jts/nanopolish that supports ARM. git clone -- recursive https :// github . com / hasindu2008 / nanopolish - arm &&cd nanopolish - arm git checkout v0 .1 make -j8 make clean make CC =" gcc - static " CXX ="g ++ - static "

2. Copy the binary to the phone as in previous examples. The launch nanopolish.

1| mh :/ data / local / tmp $ ./ nanopolish index -d / sdcard / genome /ecoli_2kb_region / fast5_files / / sdcard / genome / ecoli_2kb_region / reads .fasta

1| mh :/ data / local / tmp $ ./ nanopolish variants -r / sdcard / genome /ecoli_2kb_region / reads . fasta -b / sdcard / genome / ecoli_2kb_region / reads. bam -g / sdcard / genome / ecoli_2kb_region / draft . fa -t4 -w " tig00000001:200000 -202000 " -p1 > / sdcard / genome / ecoli_2kb_region / variants . vcf [ post - run summary ] total reads : 101 , unparseable : 0, qc fail : 0, couldnot calibrate : 0, no alignment : 0, bad fast5 : 0

1| mh :/ data / local / tmp $ ./ nanopolish call - methylation -r / sdcard / genome /ecoli_2kb_region / reads . fasta -g / sdcard / genome / ecoli_2kb_region / draft . fa -b / sdcard / genome / ecoli_2kb_region / reads . bam > / sdcard / genome /ecoli_2kb_region / ref . tsv [ post - run summary ] total reads : 143 , unparseable : 0, qc fail : 0, couldnot calibrate : 0, no alignment : 0, bad fast5 : 0 F.4 Running Directly on Phone

The section above shows how Linux command line bioinformatics tools (such as minimap2) canbe run on an Android mobile phone through Android Debug Bridge. That method requiredus to issue commands to the phone from the host PC via USB. This section shows how wecan make it a bit fancier, by issuing commands directly from the mobile phone. In summary,we will install a virtual terminal app to the phone and issue commands from there.This post assumes that the binaries have been already copied to ‘/data/local/tmp’ on yourmobile phone by following the steps in the previous section.

F.4.1 On Android 7.0 or before • Install a terminal emulator on your Android phone, for instance, Terminal Emulator forAndroid.• Launch the terminal emulator app.• On the terminal emulator append ‘/system/xbin’ to ‘PATH’ (the location of tools suchas ‘cp’ - might vary on your phone). Then change the current directory to the home,copy the binary, give executable permission and then launch the tool. An example forminimap2 is below (and Fig. F.2). 313PPENDIX F. APPENDIX: BIOINFORMATICS ON MOBILE PHONE export PATH =/ system / xbin : $PATH && cd ~ cp / data / local / tmp / minimap2 . chmod +x minimap2 ./ minimap2 F.4.2 On Android 8.x

The Above method, unfortunately, will not work on the latest Android 8. You may get a “BadSystem Call” error when you attempt to run a binary using the terminal emulator. This isdue to the seccom ﬁlter introduced in Android 8.0. If you have a rooted phone surely you canget over this by running as sudo. But luckily, still there is a way for non rooted phones - usean app that emulates the ADB client, for instance, Android Remote Debugger. Limitation ofthis method is you need a host PC (with ADB conﬁgured) to initially launch ADB server onthe phone.1. Install Android Remote Debugger on your phone2. Connect the phone through USB to the host computer (need to have ADB conﬁguredas we did in the previous section and on a command prompt issue the following. C :\ Program Files ( x86 )\ Minimal ADB and Fastboot > adb tcpip 5555 restarting in TCP mode port : 5555 You can disconnect from the computer after launching the server as above. However,you will need to perform this step every time you reboot your phone.3. Launch the Android Remote Debugger app and connect to the localhost (127.0.0.1) onport 5555 (Fig. F.3). 314.4. RUNNING DIRECTLY ON PHONE4. Now change directory to ‘/data/local/tmp’ and execute the binary (Fig. F.4).

F.4.3 Is there a proper way?

All the methods above are hacky and suitable only in a development environment. While Ihave not myself investigated proper ways, here are some thoughts.• Compile the binaries and link against bionic , the standard C library for Android (op-posed to static linking). We have to use a cross compiler for this, i.e. gcc-arm-linux-androideabi. however the dependencies (such as zlib ) have to be compiled ourselvesusing the cross compiler (cannot use the versions from apt ). However, additional re-quirements such as mandated position independent executables and restrictions on textrelocations will further complicate the compilation. After getting it compiled, you wouldmake an Android application that acts as a wrapper that calls the compiled binaries,for instance, what is suggested at https://stackoverflow.com/questions/5583487/hosting-an-executable-within-android-application .• The most proper way (but a lot of work for sure) would be to use the Android NDKto compile the C codes into native libraries (might require a restructuring of the sourcecode) which then can be called through an Android app through JNI.315PPENDIX F. APPENDIX: BIOINFORMATICS ON MOBILE PHONE (a) Entering Commands(b) Minimap2 execution (c) SAM output

Figure F.2: Executing

Minimap2 using terminal emulator316.4. RUNNING DIRECTLY ON PHONEFigure F.3: Remote ADB317PPENDIX F. APPENDIX: BIOINFORMATICS ON MOBILE PHONEFigure F.4: Execution using remote ADB318 ppendix G

Appendix: Open-sourceContributions

G.0.1 User comments for f5c “I have just had the ﬁrst sample ﬁnish after placing them on faster storage. I have to say thespeed has left me speechless and in shock. I was expecting an improvement in speed, but thisis something much more than an "improvement". With iops at 16 and the drives on fasterdisks it took just 13 hours for a 40x human sample. That is really impressive!” — a f5c user319PPENDIX G. APPENDIX: OPEN-SOURCE CONTRIBUTIONS hasindu2008 / f5c

Code

Issues Pull requests Actions Projects Wiki Security Insights SettiJump to bottom

Matching format to nanopolish

Closed LizzieMcDizzie opened this issue on 6 Mar · 1 comment

Edit New issue

LizzieMcDizzie commented on 6 Mar • Hi and thank you for such a useful tool.I was wondering if you would be able to make the headings of the outputs match:f5c headings in tsv:chromosome start end read_name log_lik_ratio log_lik_methylated log_lik_unmethylatednum_calling_strands num_cpgs sequenceNanopolish headings in tsv:chromosome strand start end read_name log_lik_ratio log_lik_methylated log_lik_unmethylatednum_calling_strands num_motifs sequencenanoploish freq table:chromosome start end num_motifs_in_group called_sites called_sites_methylatedmethylated_frequencygroup_sequencef5c freq table:chromosome start end num_cpgs_in_group called_sites called_sites_methylatedmethylated_frequencygroup_sequenceIt would be great if these were interchangeable.Thanks again for the great tool.Cheers. edited hasindu2008 commented on 9 MarThanks for reporting this and sorry for the slow response. .0.2 Contributions to

Minimap2 +1,712 −0 lh3 / minimap2

Code Issues Pull requests Actions Security Insights minimap2 on ARM processors

Merged lh3 merged 1 commit into from on 17 Dec 2017

Assignees

No one assigned

LabelsProjects

None yet

Milestone

No milestone

Linked issues

Successfully merging this pull request mayclose these issues.None yet lh3:master hasindu2008:master

Conversation Commits Checks Files changed hasindu2008 commented on 16 Dec 2017A workaround to get minimap2 working on ARM processors using their NEON SIMD instructions.The headers that convert SSE to NEON are in sse2neon/emmintrin.hSome options were added to the makefilecan be compiled for ARM with make arm_neon=1tested on Odroid XU4 and Raspberry Pi 3 Contributor added support for arm neon 8995e2e lh3 added the enhancement label on 17 Dec 2017 lh3 merged commit into on 17 Dec 2017

RevertView details lh3:master lh3 commented on 17 Dec 2017Thanks a lot. I like the way the changes were made. I don't have arm-based machines, so I am unable totest it myself. I will trust you on this.

Owner Reviewers

No reviews enhancement +8 −4 lh3 / minimap2

Code Issues Pull requests Actions Security Insights added support for 64 bit ARM architectures

Merged lh3 merged 1 commit into from on 20 Jun 2018

Assignees

No one assigned

LabelsProjects

None yet

Milestone

No milestone

Linked issues

Successfully merging this pull request mayclose these issues.None yet lh3:master hasindu2008:aarch64

Conversation Commits Checks Files changed hasindu2008 commented on 11 Jun 2018Added support for ARM 64 architectures such as ARMv8. Contributor added support for 64 bit ARM architectures 8bc2a83 lh3 added the enhancement label on 20 Jun 2018 lh3 merged commit into on 20 Jun 2018

RevertView details lh3:master lh3 commented on 20 Jun 2018Thank you!

Owner Reviewers

No reviews enhancement

Minimap2-2.12 (r827) v2.12 a5eafb7 lh3 released this on 7 Aug 2018 · 134 commits to master since this release

Changes to minimap2:Added option --split-prefix to write proper alignments (correct mappingquality and clustered query sequences) given a multi-part index ( @hasindu2008 ).Fixed a memory leak when option -y is in use.Changes to mappy:Support the MD/cs tag (

Assets minimap2-2.12.tar.bz2

142 KB minimap2-2.12_x64-linux.tar.bz2

Source code (zip)

Source code (tar.gz) lh3 / minimap2

Code

Issues Pull requests Actions Security Insights