[PDF] Accelerating the Understanding of Life's Code Through Better Algorithms and Hardware Design

Abstract

Calculating the similarities between a pair of genomic sequences is one of the most fundamental computational steps in genomic analysis. This step -- called sequence alignment -- is the computational bottleneck because: (1) it is implemented using quadratic-time dynamic programming algorithms and (2) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. In this thesis, we introduce four new algorithms (GateKeeper, Shouji, MAGNET, and SneakySnake) that function as a pre-alignment step and aim to filter out most incorrect candidate locations. The first key idea of our pre-alignment filters is to provide high filtering accuracy by correctly detecting all similar segments shared between two sequences. The second key idea is to exploit the massively parallel architecture of modern FPGAs for accelerating our filtering algorithms. We also develop an efficient CPU implementation of the SneakySnake algorithm for commodity desktops and servers. We evaluate the benefits and downsides of our pre-alignment filtering approach in detail using 12 real datasets. In our evaluation, we demonstrate that our hardware pre-alignment filters show two to three orders of magnitude speedup over their equivalent CPU implementations. We also demonstrate that integrating our hardware pre-alignment filters with the state-of-the-art read aligners reduces the aligner's execution time by up to 21.5x. Finally, we show that efficient CPU implementation of pre-alignment filtering still provides significant benefits. We show that SneakySnake on average reduces the execution time of the best performing CPU-based read aligners Edlib and Parasail, by up to 43x and 57.9x, respectively.

Full PDF

TThis thesis received the 2018 IEEE Turkey Doctoral Dissertation Award**This copy was updated on October 2019**

ACCELERATING THE UNDERSTANDINGOF LIFE’S CODE THROUGH BETTERALGORITHMS AND HARDWARE DESIGN a dissertation submitted tothe graduate school of engineering and scienceof bilkent universityin partial fulfillment of the requirements forthe degree ofdoctor of philosophyincomputer engineering

ByMohammed H. K. AlserJune 2018 a r X i v : . [ q - b i o . GN ] O c t CCELERATING THE UNDERSTANDING OF LIFE’S CODETHROUGH BETTER ALGORITHMS AND HARDWARE DESIGNBy Mohammed H. K. AlserJune 2018We certify that we have read this dissertation and that in our opinion it is fullyadequate, in scope and in quality, as a dissertation for the degree of Doctor ofPhilosophy. Can Alkan (Advisor)Onur Mutlu (Co-Advisor)Ergin AtalarMehmet Koyut¨urkNurcan Tun¸cba˘gO˘guz ErginApproved for the Graduate School of Engineering and Science:Ezhan Kara¸sanDirector of the Graduate Schoolii

BSTRACTACCELERATING THE UNDERSTANDING OF LIFE’SCODE THROUGH BETTER ALGORITHMS ANDHARDWARE DESIGN

Mohammed H. K. AlserPh.D. in Computer EngineeringAdvisor: Can AlkanCo-Advisor: Onur MutluJune 2018Our understanding of human genomes today is aﬀected by the ability of moderncomputing technology to quickly and accurately determine an individual’s entiregenome. Over the past decade, high throughput sequencing (HTS) technologieshave opened the door to remarkable biomedical discoveries through its ability togenerate hundreds of millions to billions of DNA segments per run along with asubstantial reduction in time and cost. However, this ﬂood of sequencing datacontinues to overwhelm the processing capacity of existing algorithms and hard-ware. To analyze a patient’s genome, each of these segments -called reads - mustbe mapped to a reference genome based on the similarity between a read and“candidate” locations in that reference genome. The similarity measurement,called alignment, formulated as an approximate string matching problem, is thecomputational bottleneck because: (1) it is implemented using quadratic-timedynamic programming algorithms, and (2) the majority of candidate locationsin the reference genome do not align with a given read due to high dissimilar-ity. Calculating the alignment of such incorrect candidate locations consumes anoverwhelming majority of a modern read mapper’s execution time. Therefore, itis crucial to develop a fast and eﬀective ﬁlter that can detect incorrect candidatelocations and eliminate them before invoking computationally costly alignmentalgorithms.In this thesis, we introduce four new algorithms that function as a pre-alignment step and aim to ﬁlter out most incorrect candidate locations. Wecall our algorithms GateKeeper, Shouji, MAGNET, and SneakySnake. The ﬁrstkey idea of our proposed pre-alignment ﬁlters is to provide high ﬁltering accuracyby correctly detecting all similar segments shared between two sequences.iiivThe second key idea is to exploit the massively parallel architecture of modernFPGAs for accelerating our four proposed ﬁltering algorithms. We also developan eﬃcient CPU implementation of the SneakySnake algorithm for commoditydesktops and servers, which are largely available to bioinformaticians without thehassle of handling hardware complexity. We evaluate the beneﬁts and downsidesof our pre-alignment ﬁltering approach in detail using 12 real datasets across dif-ferent read length and edit distance thresholds. In our evaluation, we demonstratethat our hardware pre-alignment ﬁlters show two to three orders of magnitudespeedup over their equivalent CPU implementations. We also demonstrate thatintegrating our hardware pre-alignment ﬁlters with the state-of-the-art read align-ers reduces the aligner’s execution time by up to 21.5x. Finally, we show thateﬃcient CPU implementation of pre-alignment ﬁltering still provides signiﬁcantbeneﬁts. We show that SneakySnake on average reduces the execution time ofthe best performing CPU-based read aligners Edlib and Parasail, by up to 43xand 57.9x, respectively. The key conclusion of this thesis is that developing a fastand eﬃcient ﬁltering heuristic, and developing a better understanding of its ac-curacy together leads to signiﬁcant reduction in read alignment’s execution time,without sacriﬁcing any of the aligner’ capabilities. We hope and believe that ournew architectures and algorithms catalyze their adoption in existing and futuregenome analysis pipelines.

Keywords:

Read Mapping, Approximate String Matching, Read Alignment, Lev-enshtein Distance, String Algorithms, Edit Distance, fast pre-alignment ﬁlter,ﬁeld-programmable gate arrays (FPGA).

OZETYAS¸AMIN KODUNU ANLAMAYI DAHA ˙IY˙IALGOR˙ITMALAR VE DONANIM TASARIMLARIYLAHIZLANDIRMAK

Mohammed H. K. AlserBilgisayar M¨uhendisli˘gi, DoktoraTez Danı¸smanı: Can Alkan˙Ikinci Tez Danı¸smanı: Onur MutluHaziran 2018G¨un¨um¨uzde insan genomları konusundaki anlayı¸sımız, modern bili¸sim teknolo-jisinin bir bireyin t¨um genomunu hızlı ve do˘gru bir ¸sekilde belirleyebilmeyetene˘ginden etkilenmektedir. Ge¸cti˘gimiz on yıl boyunca, y¨uksek verimli dizileme(HTS) teknolojileri, zaman ve maliyette ¨onemli bir azalma ile birlikte, tekbir ¸calı¸smada y¨uz milyonlardan milyarlarcaya kadar DNA par¸cası ¨uretme ka-biliyeti sayesinde dikkat ¸cekici biyomedikal ke¸siﬂere kapı a¸cmı¸stır. Ancak,bu dizileme verisi bollu˘gu mevcut algoritmaların ve donanımların i¸slem kapa-sitelerinin sınırlarını zorlamaya devam etmektedir. Bir hastanın genomunu analizetmek i¸cin, ”okuma” adı verilen bu par¸caların her biri referans genomundaki adayb¨olgelerle olan benzerliklerine bakılarak, referans genomu ¨uzerine yerle¸stirilir.Yakla¸sık karakter dizgisi e¸sle¸stirme problemi ¸seklinde form¨ule edilen ve hizalamaolarak adlandırılan benzerlik hesaplaması, i¸slemsel bir darbo˘gazdır ¸c¨unk¨u: (1)ikinci dereceden devingen programlama algoritmaları kullanılarak hesaplanır ve(2) referans genomundaki aday b¨olgelerin b¨uy¨uk bir b¨ol¨um¨u ile verilen okumapar¸cası birbirlerinden y¨uksek d¨uzeyde farklılık g¨osterdiklerinden dolayı hiza-lanamaz. Bu ¸sekilde yanlı¸s belirlenen aday b¨olgelerin hizalanabilirli˘gin hesa-planması, g¨un¨um¨uzdeki okuma haritalandırıcı algoritmaların ¸calı¸sma s¨urelerininb¨uy¨uk b¨ol¨um¨un¨u olu¸sturmaktadır. Bu nedenle, hesaplama olarak maliyetli buhizalama algoritmalarını ¸calı¸stırmadan ¨once, do˘gru olmayan aday b¨olgeleri tespitedebilen ve bu b¨olgeleri aday b¨olge olmaktan ¸cıkaran, hızlı ve etkili bir ﬁltregeli¸stirmek ¸cok ¨onemlidir.Bu tezde, ¨on hizalama a¸saması olarak i¸slev g¨oren ve yanlı¸s aday konumlarının¸co˘gunu ﬁltrelemeyi hedeﬂeyen d¨ort yeni algoritma sunuyoruz. AlgoritmalarımızıGateKeeper, Shouji, MAGNET ve SneakySnake olarak adlandırıyoruz.vi¨Onerilen ¨on hizalama ﬁltrelerinin ilk temel ﬁkri, iki dizi arasında payla¸sılant¨um benzer segmentleri do˘gru bir ¸sekilde tespit ederek y¨uksek ﬁltrelemedo˘grulu˘gu sa˘glamaktır. ˙Ikinci temel ﬁkir, ¨onerilen d¨ort ﬁltreleme algoritmamızınhızlandırılması i¸cin modern FPGA’ların ¸cok b¨uy¨uk ¨ol¸cekte paralel mimarisinikullanmaktır. SneakySnake’i esas olarak biyoinformatisyenlerin mevcut olan, do-nanım karma¸sıklı˘gı ile u˘gra¸smak zorunda olmadıkları emtia masa¨ust¨u ve sunucu-larında kullanabilmeleri i¸cin geli¸stirdik. ¨On okuma ﬁltreleme yakla¸sımımızınavantaj ve dezavantajlarını 12 ger¸cek veri setini, farklı okuma uzunlukları vemesafe e¸sikleri kullanarak ayrıntılı olarak de˘gerlendirdik. De˘gerlendirmemizde,donanım ¨on hizalama ﬁltrelerimizin e¸sde˘ger CPU uygulamalarına g¨ore iki ila¨u¸c derece hızlı olduklarını g¨osteriyoruz. Donanım ¨on hizalama ﬁltrelerim-izi son teknoloji okuma hizalayıcılarıyla entegre etmenin hizalayıcının ¸calı¸smas¨uresini d¨uzenleme mesafesi e¸si˘gine ba˘glı olarak 21.5x. Son olarak, ¨on hiza-lama ﬁltrelerinin etkin CPU uygulamasının hala ¨onemli faydalar sa˘gladı˘gınıg¨osteriyoruz. SneakySnake’in en iyi performansa sahip CPU tabanlı okumaayarlayıcıları Edlib ve Parasail’in y¨ur¨utme s¨urelerini sırasıyla 43x ve 57,9x’ekadar azalttı˘gını g¨osteriyoruz. Bu tezin ana sonucu, hızlı ve verimli bir ﬁl-treleme mekanizması geli¸stirilmesi ve bu mekanizmanın do˘grulu˘gunun daha iyianla¸sılması, hizalayıcıların yeteneklerinden hi¸cbir ¸sey ¨od¨un vermeden, okumahizalamasının ¸calı¸sma s¨uresinde ¨onemli bir azalmaya yol a¸cmaktadır. Yeni mimar-ilerimizin ve algoritmalarımızın, mevcut ve gelecekteki genom analiz planlarındabenimsenmelerini katalize etti˘gimizi umuyor ve buna inanıyoruz.

Anahtar s¨ozc¨ukler : Haritalamayı Oku, Yakla¸sık Dize E¸sle¸stirme, HizalamayıOku, Levenshtein Mesafesi, String Algoritmaları, Mesafeyi D¨uzenle, hızlı ¨on hiza-lama ﬁltresi, Alan programlanabilir kapı dizileri (FPGA). cknowledgement

The last nearly four years at Bilkent University have been most exciting andfruitful time of my life, not only in the academic arena, but also on a personal level.For that, I would like to seize this opportunity to acknowledge all people whohave supported and helped me to become who I am today. First and foremost,the greatest of all my appreciation goes to the almighty Allah for granting mecountless blessings and helping me stay strong and focused to accomplish thiswork.I would like to extend my sincere thanks to my advisor Can Alkan, who intro-duced me to the realm of bioinformatics. I feel very privileged to be part of hislab and I am proud that I am going to be his ﬁrst PhD graduate. Can generouslyprovided the resources and the opportunities that enabled me to publish and col-laborate with researchers from other institutions. I appreciate all his support andguidance at key moments in my studies while allowing me to work independentlywith my own ideas. I truly enjoyed the memorable time we spent together inAnkara, Izmir, Istanbul, Vienna, and Los Angeles. Thanks to him, I get addictedto the dark chocolate with almonds and sea salt.I am also very thankful to my co-advisor Onur Mutlu for pushing my academicwriting to a new level. It is an honor that he also gave me the opportunity to joinhis lab at ETH Z¨urich as a postdoctoral researcher. His dedication to research iscontagious and inspirational.I am grateful to the members of my thesis committee: Ergin Atalar, MehmetKoyut¨urk, Nurcan Tun¸cba˘g, O˘guz Ergin, and Ozcan Ozt¨urk for their valuablecomments and discussions.I would also like to thank Eleazar Eskin and Jason Cong for giving me theopportunity to join UCLA during the summer of 2017 as a staﬀ research associate.I would like to extend my thanks to Serghei Mangul, David Koslicki, FarhadHormozdiari, and Nathan LaPierre who provided the guidance to make my visitfruitful and enjoyable. I am also thankful to Akash Kumar for giving me thechance to join the Chair for Processor Design during the winter of 2017-2018 asa visiting researcher at TU Dresden. viiiiiI would like to acknowledge Nordin Zakaria and Izzatdin B A Aziz for providingme the opportunity to continue carrying out PETRONAS projects while i am inTurkey. I am grateful to Nor Hisham Hamid for believing in me, hiring me asa consultant for one of his government-sponsored projects, and inviting me toMalaysia. It was my great fortune to work with them.I want to thank all my co-authors and colleagues: Eleazar Eskin, Erman Ayday,Hongyi Xin, Hasan Hassan, Serghei Mangul, O˘guz Ergin, Akash Kumar, NourAlmadhoun, Jeremie Kim, and Azita Nouri.I would like to acknowledge all past and current members of our researchgroup for being both great friends and colleagues to me. Special thanks to FatmaKahveci for being a great labmate who taught me her native language and tol-erated me for many years. Thanks to G¨ulfem Demir for all her valuable supportand encouragement. Thanks to Shatlyk Ashyralyyev for his great help when Iprepared for my qualifying examination and for many valuable discussions onresearch and life. Thanks to Handan Kulan for her friendly nature and pricelesssupport. Thanks to Can Firtina and Damla S¸enol Cali (CMU) for all the help intranslating the abstract of this thesis to Turkish. I am grateful to other membersfor their companionship: Marzieh Eslami Rasekh, Azita Nouri, Elif Dal, Z¨ulalBing¨ol, Ezgi Ebren, Arda S¨oylev, Halil ˙Ibrahim ¨Ozercan, Fatih Karao˘glano˘glu,Tu˘gba Do˘gan, and Balanur ˙I¸cen.I would like to express my deepest gratitude to Ahmed Nemer Almadhoun,Khaldoun Elbatsh (and his wonderful family), Hamzeh Ahangari, Soe Thane,Mohamed Meselhy Eltoukhy (and his kind family), Abdel-Haleem Abdel-Aty,Heba Kadry, and Ibrahima Faye for all the good times and endless generoussupport in the ups and downs of life. I feel very grateful to all my inspiringfriends who made Bilkent a happy home for me, especially Tamer Alloh, BasilHeriz, Kadir Akbudak, Amirah Ahmed, Zahra Ghanem, Muaz Draz, Nabeel AbuBaker, Maha Sharei, Salman Dar, Elif Do˘gan Dar, Ahmed Ouf, Shady Zahed,Mohammed Tareq, Abdelrahman Teskyeh, Obada and many others.My graduate study is an enjoyable journey with my lovely wife,Nour Almadhoun, and my sons, Hasan and Adam. Their love andnever ending support always give me the internal strength to move onwhenever I feel like giving up or whenever things become too hard.xPursuing PhD together with Nour in the same department is our most preciousmemory in lifetime. We could not have been completed any of our PhD workwithout continuously supporting and encouraging each other.My family has been a pillar of support, encouragement and comfort all through-out my journey. Thanks to my parents, Hasan Alser and Itaf Abdelhadi, whoraised me with a love of science and supported me in all my pursuits. Thanks tomy lovely sister Deema and my brothers Ayman (and his lovely family), Hani, So-haib, Osaid, Moath, and Ayham for always believing in me and being by my side.Thanks to my parents-in-law, Mohammed Almadhoun and Nariman Almadhoun,my brotherss-in-law, Ahmed Nemer, Ezz Eldeen, Moatasem, Mahmoud, sisters-in-law, Narmeen, Rania, and Eman, and their families for all their understanding,great support, and love.Finally, I gratefully acknowledge the funding sources that made my PhD workpossible. I was honored to be a T ¨UB˙ITAK Fellow for 4 years oﬀered by the Scien-tiﬁc and Technological Research Council of Turkey under 2215 program. In 2017,I was also honored to receive the HiPEAC Collaboration Grant. I am gratefulto Bilkent University for providing generous ﬁnancial support and funding myacademic visits. This thesis was supported by NIH Grant (No. HG006004) toOnur Mutlu and Can Alkan and a Marie Curie Career Integration Grant (No.PCIG-2011-303772) to Can Alkan under the Seventh Framework Programme.

Mohammed H. K. AlserJune 2018, Ankara, Turkey ontents

ONTENTS xi3.3 Discussion on Improving the Filtering Accuracy . . . . . . . . . . 343.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

ONTENTS xii7.2.1 Method 1: Building the Neighborhood Map . . . . . . . . 667.2.2 Method 2: Identifying the E +1 Identical Subsequences . . 667.2.3 Method 3: Filtering Out Dissimilar Sequences . . . . . . . 687.3 Analysis of MAGNET algorithm . . . . . . . . . . . . . . . . . . . 697.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 ONTENTS xiii

10 Conclusions and Future Directions 118

11 Other Works of This Author 122A Data 141 ist of Figures

IST OF FIGURES xv2.1 Timeline of read mappers. CPU-based mappers are plotted inblack, GPU accelerated mappers in red, FPGA accelerated map-pers in blue and SSE-based mappers in green. Grey dotted linesconnect related mappers (extensions or new versions). The namesin the timeline are exactly as they appear in publications, except:SOAP3-FPGA [103], BWA-MEM-FPGA [55], BFAST-Olson [104],BFAST-Yus [105], BWA-Waidyasooriya [58], and BWA-W [68]. . . 183.1 An example of a mapping with all its generated masks, where theedit distance threshold ( E ) is set to 3. The green highlighted sub-sequences are part of the correct alignment. The red highlightedbit in the ﬁnal bit-vector is a wrong alignment provided by SHD.The correct alignment (highlighted in yellow) shows that there arethree substitutions and a single deletion, while SHD detects onlytwo substitutions and a single deletion. . . . . . . . . . . . . . . . 293.2 Examples of an incorrect mapping that passes the SHD ﬁlter dueto the random zeros. While the edit distance threshold is 3, a map-ping of 4 edits (as examined at the end of the ﬁgure by Needleman-Wunsch algorithm) passes as a falsely accepted mapping. . . . . . 303.3 An example of an incorrect mapping that passes the SHD ﬁlterdue to conservative counting of the short streak of ‘1’s in the ﬁnalbit-vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Examples of an invalid mapping that passes the SHD ﬁlter due tothe leading and trailing zeros. We use an edit distance threshold of3 and an SRS threshold of 2. While the regions that are highlightedin green are part of the correct alignment, the wrong alignmentprovided by SHD is highlighted in red. The yellow highlighted bitsindicate a source of falsely accepted mapping. . . . . . . . . . . . 323.5 Examples of incorrect mappings that pass the SHD ﬁlter due to(a) overlapping identical subsequences, and (b) lack of backtracking. 334.1 Overview of our accelerator architecture. . . . . . . . . . . . . . . 384.2 Xilinx Virtex-7 FPGA VC709 Connectivity Kit and chip layout ofour implemented accelerator. . . . . . . . . . . . . . . . . . . . . . 39 IST OF FIGURES xvi5.1 A ﬂowchart representation of the GateKeeper algorithm. . . . . . 445.2 An example showing how various types of edits aﬀect the align-ment of two reads. In (a) the upper read exactly matches thelower read and thus each base exactly matches the correspondingbase in the target read. (b) shows base-substitutions that onlyaﬀect the alignment at their positions. (c) and (d) demonstrateinsertions and deletions, respectively. Each edit has an inﬂuenceon the alignment of all the subsequent bases. . . . . . . . . . . . . 455.3 Workﬂow of the proposed architecture for the parallel amendmentoperations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4 An example of applying our solution for reducing the number ofbits of each Hamming mask by half. We use a modiﬁed Hammingmask to store the result of applying the bitwise OR operation toeach two bits of the Hamming mask. The modiﬁed mask maintainsthe same meaning of the original Hamming mask. . . . . . . . . . 486.1 Random edit distribution in a read sequence. The edits ( e , e ,. . . , e E ) act as dividers resulting in several identical subsequences( m , m , . . . , m E +1 ) between the read and the reference. . . . . . 536.2 Neighborhood map ( N ) and the Shouji bit-vector, for text T =GGTGCAGAGCTC, and pattern P = GGTGAGAGTTGT for E =3. The three common subsequences (i.e., GGTG, AGAG, andT) are highlighted in yellow. We use a search window of size 4columns (two examples of which are high-lighted in red) with astep size of a single column. Shouji searches diagonally within eachsearch window for the 4-bit vector that has the largest numberof zeros. Once found, Shouji examines if the found 4-bit vectormaximizes the number of zeros at the corresponding location ofthe 4-bit vector in the Shouji bit-vector. If so, then Shouji storesthis 4-bit vector in the Shouji bit-vector at its corresponding location. 55 IST OF FIGURES xvii6.3 The eﬀect of the window size on the rate of the falsely acceptedsequences (i.e., dissimilar sequences that are considered as simi-lar ones by Shouji ﬁlter). We observe that a window width of 4columns provides the highest accuracy. We also observe that aswindow size increases beyond 4 columns, more similar sequencesare rejected by Shouji, which should be avoided. . . . . . . . . . . 586.4 An example of applying Shouji ﬁltering algorithm to a sequencepair, where the edit distance threshold is set to 4. We present thecontent of the neighborhood map along with the Shouji bit-vector.We apply the Shouji algorithm starting from the leftmost columntowards the rightmost column. . . . . . . . . . . . . . . . . . . . . 596.5 A block diagram of the sliding window scheme implemented inFPGA for a single ﬁltering unit. . . . . . . . . . . . . . . . . . . . 627.1 The false accept rate of MAGNET using two diﬀerent algorithmsfor identifying the identical subsequences. We observe that ﬁndingthe E +1 non-overlapping identical subsequences leads to a signif-icant reduction in the incorrectly accepted sequences compared toﬁnding the top E +1 longest identical subsequences. . . . . . . . . 687.2 An example of applying MAGNET ﬁltering algorithm with anedit distance threshold of 4. MAGNET ﬁnds all the longest non-overlapping subsequences of consecutive zeros in the descendingorder of their length (as numbered in yellow). . . . . . . . . . . . 698.1 Weighted Neighborhood maze ( WN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGTfor E =3. The snake traverses the path that is highlighted in yellowwhich includes three obstacles and three pipes, that representsthree identical subsequences: GGTG, AGAG, and T . . . . . . . . 788.2 Unweighted neighborhood maze ( UN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGTfor E =3. The snake traverses the path that is highlighted in yellowwhich includes three obstacles and three pipes, that representsthree identical subsequences: GGTG, AGAG, and T. . . . . . . . 80 IST OF FIGURES xviii8.3 Proposed 4-bit LZC comparator. . . . . . . . . . . . . . . . . . . 858.4 Block diagram of the 11 LZCs and the hierarchical LZC compara-tor tree for computing the largest number of leading zeros in 11rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869.1 The eﬀects of column-wise partitioning the search space ofSneakySnake algorithm on the average false accept rate ((a), (c),and (e)) and the average execution time ((b), (d), and (f)) of ex-amining set 1 to set 4 in (a) and (b), set 5 to set 8 in (c) and (d),and set 9 to set 12 in (e) and (f). Besides the default size (equalsthe read length) of the SneakySnake’s unweighted neighborhoodmaze, we choose partition sizes (the number of grid’s columns thatare included in each partition) of 5, 10, 25, and 50 columns. . . . 969.2 The eﬀects of the number of replications of the hardware imple-mentation of SneakySnake algorithm on its ﬁltering accuracy (falseaccept rate). We use a wide range of edit distance thresholds (0bp-10 bp for a read length of 100 bp) and four datasets: (a) set 1,(b) set 2, (c) set 3, and (d) set 4. . . . . . . . . . . . . . . . . . . 979.3 The eﬀects of computing the preﬁx edit distance on the overallexecution time of the edit distance calculations compared to theoriginal Edlib (exact edit distance) and our partitioned implemen-tation of SneakySnake algorithm. We present the average timespent in examining set 1 to set 4 in (a), set 5 to set 8 in (b), andset 9 to set 12 in (c). We choose initial sub-matrix sizes of 5, 10,25, and 50 columns. We mark the intersection of SneakySnake-5and Edlib-50 plots with a dashed vertical line. . . . . . . . . . . . 999.4 The false accept rate produced by our pre-alignment ﬁlters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to thebest performing ﬁlter, SHD [65]. We use a wide range of edit dis-tance thresholds (0-10 edits for a read length of 100 bp) and fourdatasets: (a) set 1, (b) set 2, (c) set 3, and (d) set 4. . . . . . . . 101

IST OF FIGURES xix9.5 The false accept rate produced by our pre-alignment ﬁlters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to thebest performing ﬁlter, SHD [65]. We use a wide range of edit dis-tance thresholds (0-15 edits for a read length of 150 bp) and fourdatasets: (a) set 5, (b) set 6, (c) set 7, and (d) set 8. . . . . . . . 1029.6 The false accept rate produced by our pre-alignment ﬁlters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to thebest performing ﬁlter, SHD [65]. We use a wide range of edit dis-tance thresholds (0-25 edits for a read length of 250 bp) and fourdatasets: (a) set 9, (b) set 10, (c) set 11, and (d) set 12. . . . . . 1039.7 An example of a falsely-rejected mapping using MAGNET algo-rithm for an edit distance threshold of 6. The random zeros (high-lighted in red) confuse MAGNET ﬁlter causing it to select shortersegments of random zeros instead of a longer identical subsequence(highlighted in blue). . . . . . . . . . . . . . . . . . . . . . . . . . 1049.8 End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment ﬁlters. We use fourdatasets ((a) set 1, (b) set 2, (c) set 3, and (d) set 4) across dif-ferent edit distance thresholds. We highlight in a dashed verticalline the edit distance threshold where Edlib starts to outperformour SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 1109.9 End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment ﬁlters. We use fourdatasets ((a) set 5, (b) set 6, (c) set 7, and (d) set 8) across dif-ferent edit distance thresholds. We highlight in a dashed verticalline the edit distance threshold where Edlib starts to outperformour SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 1119.10 End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment ﬁlters. We use fourdatasets ((a) set 9, (b) set 10, (c) set 11, and (d) set 12) acrossdiﬀerent edit distance thresholds. We highlight in a dashed verticalline the edit distance threshold where Edlib starts to outperformour SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 112

IST OF FIGURES xx9.11 End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment ﬁlters. We use fourdatasets ((a) set 1, (b) set 2, (c) set 3, and (d) set 4) across dif-ferent edit distance thresholds. . . . . . . . . . . . . . . . . . . . . 1139.12 End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment ﬁlters. We use fourdatasets ((a) set 5, (b) set 6, (c) set 7, and (d) set 8) across dif-ferent edit distance thresholds. . . . . . . . . . . . . . . . . . . . . 1149.13 End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment ﬁlters. We use fourdatasets ((a) set 9, (b) set 10, (c) set 11, and (d) set 12) acrossdiﬀerent edit distance thresholds. . . . . . . . . . . . . . . . . . . 1159.14 Memory utilization of Edlib (path) read aligner while evaluatingset 12 for an edit distance threshold of 25. . . . . . . . . . . . . . 1169.15 Memory utilization of exact edit distance algorithm (Edlib ED)combined with Edlib (path) read aligner while evaluating set 12for an edit distance threshold of 25. . . . . . . . . . . . . . . . . . 1169.16 Memory utilization of SneakySnake-5 combined with Edlib (path)read aligner while evaluating set 12 for an edit distance thresholdof 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 ist of Tables -e parameter). . . . . . . . . . . . 919.3 FPGA resource usage for a single ﬁltering unit of GateKeeper,Shouji, MAGNET, and SneakySnake for a sequence length of 100and under diﬀerent edit distance thresholds ( E ). . . . . . . . . . . 939.4 The execution time (in seconds) of GateKeeper, MAGNET, Shouji,and SneakySnake under diﬀerent edit distance thresholds. We useset 1 to set 4 with a read length of 100. We provide the per-formance results for the CPU implementations and the hardwareaccelerators with the maximum number of ﬁltering units. . . . . . 1069.5 End-to-end execution time (in seconds) for several state-of-the-artsequence alignment algorithms, with and without pre-alignmentﬁlters (SneakySnake, Shouji, MAGNET, GateKeeper, and SHD)and across diﬀerent edit distance thresholds. We use four datasets(set 1, set 2, set 3, and set 4) across diﬀerent edit distance thresh-olds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108xxi IST OF TABLES xxiiA.1 Details of our ﬁrst four datasets (set 1, set 2, set 3, and set 4). Weuse Edlib [78] (edit distance mode) to benchmark the accepted andthe rejected pairs for edit distance thresholds of 0 up to 10 edits. . 142A.2 Details of our second four datasets (set 5, set 6, set 7, and set 8).We use Edlib [78] (edit distance mode) to benchmark the acceptedand the rejected pairs for edit distance thresholds of 0 up to 15 edits.143A.3 Details of our last four datasets (set 9, set 10, set 11, and set 12).We use Edlib [78] (edit distance mode) to benchmark the acceptedand the rejected pairs for edit distance thresholds of 0 up to 25 edits.143A.4 Details of evaluating the feasibility of reducing the search spacefor SneakySnake and Edlib, evaluated using set 1, set 2, set 3, andset 4 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.5 Details of evaluating the feasibility of reducing the search spacefor SneakySnake and Edlib, evaluated using set 5, set 6, set 7, andset 8 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.6 Details of evaluating the feasibility of reducing the search space forSneakySnake and Edlib, evaluated using set 9, set 10, set 11, andset 12 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 hapter 1Introduction

Genome is the code of life that includes set of instructions for making everythingfrom humans to elephants, bananas, and yeast. Analyzing the life’s code helps,for example, to determine diﬀerences in genomes from human to human thatare passed from one generation to the next and may cause diseases or diﬀerenttraits [1, 2, 3, 4, 5, 6]. One beneﬁt of knowing the genetic variations is betterunderstanding and diagnosis diseases such as cancer and autism [7, 8, 9, 10]and the development of eﬃcient drugs [11, 12, 13, 14, 15]. The ﬁrst step ingenome analysis is to reveal the entire content of the subject genome – a processknown as DNA sequencing [16]. Until today, it remains challenging to sequencethe entire DNA molecule as a whole. As a workaround, high throughput DNAsequencing (HTS) technologies are used to sequence random fragments of copiesof the original molecule. These fragments are called short reads and are 75-300basepairs (bp) long. The resulting reads lack information about their order andorigin (i.e., which part of the subject genome they are originated from). Hence themain challenge in genome analysis is to construct the donor’s complete genomewith respect to a reference genome. 1uring a process, called read mapping, each read is mapped onto one or morepossible locations in the reference genome based on the similarity between theread and the reference sequence segment at that location (like solving a jigsawpuzzle). The similarity measurement is referred to as optimal read alignment (i.e.,veriﬁcation) and could be calculated using the Smith-Waterman local alignmentalgorithm [17]. It calculates the alignment that is an ordered list of charactersrepresenting possible edit operations and matches required to change one of thetwo given sequences into the other. Commonly allowed edit operations includedeletion, insertion and substitution of characters in one or both sequences. Asany two sequences can have several diﬀerent arrangements of the edit operationsand matches (and hence diﬀerent alignments), the alignment algorithm usuallyinvolves a backtracking step. This step ﬁnds the alignment that has the highestalignment score (called optimal alignment). The alignment score is the sumof the scores of all edits and matches along the alignment implied by a user-deﬁned scoring function. However, this approach is infeasible as it requires

O(mn) running time, where m is the read length (few hundreds of bp) and n is thereference length ( ∼ seed-and-extend . In this strategy, amapper applies heuristics to ﬁrst ﬁnd candidate map locations (seed locations)of subsequences of the reads using hash tables (BitMapper [18], mrFAST withFastHASH [19], mrsFAST [20]) or BWT-FM indices (BWA-MEM [21], Bowtie 2[22], SOAP3-dp [23]). It then aligns the read in full only to those seed locations.Although the strategies for ﬁnding seed locations vary among diﬀerent read map-ping algorithms, seed location identiﬁcation is typically followed by alignmentstep. The general goal of this step is to compare the read to the reference seg-ment at the seed location to check if the read aligns to that location in the genomewith fewer diﬀerences (called edits ) than a threshold [24].2igure 1.1: Rate of veriﬁed and unveriﬁed read-reference pairs that are generatedby mrFAST mapper and are fed into its read alignment algorithm. We set thethreshold to 2, 3, and 5 edits. The alignment step is the performance bottleneck of today’s read map-per taking over 70% to 90% of the total running time [25, 18, 19] . Wepinpoint three speciﬁc problems that cause, aﬀect, or exacerbate the long align-ment’s execution time.(1) We ﬁnd that across our data set (see

Chapter 9 ), an overwhelming ma-jority (more than 90% as we present in Figure 1.1) of the seed locations, thatare generated by a state-of-the-art read mapper, mrFAST with FastHASH [19],exhibit more edits than the allowed threshold. These particular seed locationsimpose a large computational burden as they waste 90% of the alignment’s exe-cution time in verifying these incorrect mappings. This observation is also in linewith similar results for other read mappers [19, 25, 18, 26].(2) Typical read alignment algorithm also needs to tolerate sequencing errors [27, 28] as well as genetic variations [29]. The read alignment approach is non-additive measure [30]. This means that if we divide the sequence pair into twoconsecutive subsequence pairs, the edit distance of the entire sequence pair isnot necessarily equivalent to the sum of the edit distances of the shorter pairs.Instead, we need to examine all possible preﬁxes of the two input sequences andkeep track of the pairs of preﬁxes that provide an optimal solution. Therefore,3he read alignment step is implemented using dynamic programming algorithmsto avoid re-examining the same preﬁxes many times. This includes Levenshteindistance [31], Smith-Waterman [17], Needleman-Wunsch [32] and their improvedimplementations. These algorithms are ineﬃcient as they run in a quadratic-timecomplexity in the read length, m , (i.e., O(m ) ).(3) This computational burden is further aggravated by the unprecedentedﬂood of sequencing data which continues to overwhelm the processing capacityof existing algorithms and compute infrastructures [33]. While today’s HTS ma-chines (e.g., Illumina HiSeq4000) can generate more than 300 million bases perminute, state-of-the-art read mapper can only map 1% of these bases per minute[34]. The situation gets even worse when one tries to to understand a complexdisease (e.g., autism and cancer) [8, 9, 35, 36, 37, 38, 39] or proﬁle a metage-nomics sample [40, 41, 42, 43], which requires sequencing hundreds of thousandsof genomes. The long execution time of modern-day read alignment can severelyhinder such studies. There is also an urgent need for rapidly incorporating clinicalsequencing into clinical practice for diagnosis of genetic disorders in critically illinfants [44, 45, 46, 47]. While early diagnosis in such infants shortens the clinicalcourse and enables optimal outcomes [48, 49, 50], it is still challenging to de-liver eﬃcient clinical sequencing for tens to hundreds of thousands of hospitalizedinfants each year [51].Tackling these challenges and bridging the widening gap between the executiontime of read alignment and the increasing amount of sequencing data necessitatethe development of fundamentally new, fast, and eﬃcient read alignment algo-rithms. In the next sections, we provide the motivation behind our proposedwork to considerably boost the performance of read alignment. We also providefurther background information and literature study in Chapter 2.4 .2 Motivation We present in Figure 1.2 the ﬂow chart of a typical seed-and-extend based mapperduring the mapping stage. The mapper follows ﬁve main steps to map a readset to the reference genome sequence. (1) In step 1, typically a mapper ﬁrstconstructs fast indexing data structure (e.g., large hash table) for short segments(called seeds or k-mers or q-maps) of the reference sequence. (2) In step 2, themapper extracts short subsequences from a read and uses them to query the hashtable. (3) In step 3, The hash table returns all the occurrence hits of each seedin the reference genome. Modern mappers employ seed ﬁltering mechanism toreduce the false seed locations that leads to incorrect mappings. (4) In step 4, foreach possible location in the list, the mapper retrieves the corresponding referencesegment from the reference genome based on the seed location. The mappercan then examine the alignment of the entire read with the reference segmentusing fast ﬁltering heuristics that reduce the need for the dynamic programmingalgorithms. It rejects the mappings if the read and the reference are obviouslydissimilar.Otherwise, the mapper proceeds to the next step. (5) In step 5, the map-per calculates the optimal alignment between the read sequence and the ref-erence sequence using a computationally costly sequence alignment (i.e., veri-ﬁcation) algorithm to determine the similarity between the read sequence andthe reference sequence. Many attempts were made to tackle the computation-ally very expensive alignment problem. Most existing works tend to follow oneof three key directions: (1) accelerating the dynamic programming algorithms[52, 53, 54, 55, 56, 57, 58, 23, 59, 60], (2) developing seed ﬁlters that aim toreduce the false seed locations [18, 19, 26, 61, 62, 63, 64], and (3) developingpre-alignment ﬁltering heuristics [65].The ﬁrst direction takes advantage of parallelism capabilities of high perfor-mance computing platforms such as central processing units (CPUs) [54, 56],graphics processing units (GPUs) [57, 23, 59], and ﬁeld-programmable gate ar-rays (FPGAs) [52, 53, 66, 67, 55, 58, 60]. Among these computing platforms,5PGA accelerators seem to yield the highest performance gain [53, 68]. However,many of these eﬀorts either simplify the scoring function, or only take into accountaccelerating the computation of the dynamic programming matrix without pro-viding the optimal alignment (i.e., backtracking) as in [66, 67]. Diﬀerent scoringfunctions are typically needed to better quantify the similarity between the readand the reference sequence segment [69]. The backtracking step required for opti-mal alignment computation involves unpredictable and irregular memory accesspatterns, which poses diﬃcult challenge for eﬃcient hardware implementation.The second direction to accelerate read alignment is to use ﬁltering heuristicsto reduce the size of the seed location list. This is the basic principle of nearlyall mappers that employ seed-and-extend approach. Seed ﬁlter applies heuristicsto reduce the output location list. The location list stores all the occurrencelocations of each seed in the reference genome. The returned location list can betremendously large as a mapper searches for an exact matches of short segment(typically 10 bp -13 bp for hash-based mappers) between two very long homolo-gous genomes [70]. Filters in this category suﬀer from low ﬁltering accuracy asthey can only look for exact matches with the help of hash table. Thus, theyquery a few number of seeds per read (e.g., in Bowtie 2 [22], it is 3 ﬁxed lengthseeds at ﬁxed locations) to maintain edit tolerance. mrFAST [19] uses anotherapproach to increase the seed ﬁltering accuracy by querying the seed and itsshifted copies. This idea is based on the observation that indels cause the trailingcharacters to be shifted to one direction. If one of the shifted copies of the seed,generated from the read sequence, or the seed itself matches the correspondingseed from the reference, then this seed has zero edits. Otherwise, this approachcalculates the number of edits in this seed as a single edit (that can be a singleindel or a single substitution). Therefore, this approach fails to detect the correctnumber of edits for these case, for example, more than one substitutions in thesame seed, substitutions and indel in the same seed, or more than one indels inthe last seed). Seed ﬁltering successes to eliminate some of the incorrect locationsbut it is still unable to eliminate suﬃciently enough large portion of the false seedlocations, as we present in Figure 1.1. 6 A ACGTACGTACGTACGT

TATATATACGTACTAGTACGT

ACGACTTTAGTACGTACGT

TATATATACGTACTAGTACGT

ACGTACGCCCCTACGTA

ACGACTTTAGTACGTACGT

TATATATACGTACTAAAGTACGT

CCCCCCTATATATACGTACTAGTACGTTATATATACGTACTAGTACGT

TATATATACGTACTAGTACGT

ACGTTTTTAAAACGTA

ACGACGGGGAGTACGTACGT

TATATATACGTACTAAAGTACGT seed ... ...

Reference GenomeBillions of Short Read

High throughput DNA sequencing (HTS) technologies ... ...

Reference Genome L1 Reference genome indexing Hash table or BWT-FM L2L3 L4L5L6L7 L10L11L12L13L22L9L8L15L20L14 L19L18L16L17L26

TATATATACGTACTAGTACGTCCTATATATACGTACTAGTACGTCCC

A C T T A G C A C T A C T A G A A C T T C T A T A A T A C GCATATATACG

L2 L11L8 L18

TATATATACGTACTA GTCCG T

Seed selectionExtracting seeds from read set Seed querying & filtering Each seed has its own location list Read 1:Read 2:Read 3: ... ... ...

CCTATAT TACGTACTAGTACGT T

Read 1:

CTATATA ACGTACTAGTACGTG G CC

Ref. 1 :

L30L37L60

Pre-alignment FilteringExamining each mapping individually Read Alignment Slow and Accurate Seed location listseeds Indexing data structure

L7 L14 L1 L10L9 L19 L30

Exact match seeds from step 3Estimating the total number of edits quickly

L50 AG L3 L4 L12 L16 L37

Finding all locations at reference genome where seeds exact match, then pruning the seed location list

L5 L13L15 L17 L200 .sam file contains necessary alignment information

Figure 1.2: The ﬂow chart of seed-and-extend based mappers that includes ﬁvemain steps. (1) Typical mapper starts with indexing the reference genome usinga scheme based on hash table or BWT-FM. (2) It extracts number of seeds fromeach read that are produced using sequencing machine. (3) It obtains a list ofpossible locations within the reference genome that could result in a match withthe extracted seeds. (4) It applies fast heuristics to examine the alignment ofthe entire read with its corresponding reference segment of the same length. (5)It uses expensive and accurate alignment algorithm to determine the similaritybetween the read and the reference sequences.7 % % % % % % % % % % % % % % % % % % % % %

2x 4x 8x 16x 32x 64x 128x P r o c e ss i n g t i m e ( s e c ) f o r m illi o n m a pp i n g s Pre-alignment filtering accuracy and filtering speed compared to alignment step

End-to-end alignment time without pre-alignment filter (sec)

End-to-end alignment time with pre-alignment filter (sec)

The desired processing time (Our goal)We aim to avoid these two regions

Figure 1.3: End-to-end execution time (in seconds) for read alignment, with (blueplot) and without (green plot) pre-alignment ﬁlter. Our goal is to signiﬁcantlyreduce the alignment time spent on verifying incorrect mappings (highlighted inyellow). We sweep the percentage of rejected mappings and the ﬁltering speedcompared to alignment algorithm in the x-axis.The third direction to accelerate read alignment is to minimize the number ofincorrect mappings on which alignment is performed by incorporating ﬁlteringheuristics. This is the last line of defense before invoking computationally expen-sive read alignment. Such ﬁlters come into play before read alignment (i.e., hencecalled pre-alignment ﬁlter), discarding incorrect mappings that alignment woulddeem a poor match. Though the seed ﬁltering and the pre-alignment ﬁlteringhave the same goal, they are fundamentally diﬀerent problems. In pre-alignmentﬁltering approach, a ﬁlter needs to examine the entire mapping. They calculatea best guess estimate for the alignment score between a read sequence and a ref-erence segment. If the lower bound exceeds a certain number of edits, indicatingthat the read and the reference segment do not align, the mapping is eliminatedsuch that no alignment is performed. Unfortunately, the best performing exist-ing pre-alignment ﬁlter, such as shifted Hamming distance (SHD), is slow and itsmechanism introduces inaccuracy in its ﬁltering unnecessarily as we show in ourstudy in Chapter 3 and in our experimental evaluation, Chapter 9.8 re-alignment ﬁlter enables the acceleration of read alignment andmeanwhile oﬀers the ability to make the best use of existing read align-ment algorithms .These beneﬁts come without sacriﬁcing any capabilities of these algorithms,as pre-alignment ﬁlter does not modify or replace the alignment step. This mo-tivates us to focus our improvement and acceleration eﬀorts on pre-alignmentﬁltering. We analyze in Figure 1.3 the eﬀect of adding pre-alignment ﬁlteringstep before calculating the optimal alignment and after generating the seed lo-cations. We make two key observations. (1) The reduction in the end-to-endprocessing time of the alignment step largely depends on the accuracy and thespeed of the pre-alignment ﬁlter. (2) Pre-alignment ﬁltering can provide unsatis-factory performance (as highlighted in red) if it can not reject more than about30% of the potential mappings while it’s only 2x-4x faster than read alignmentstep.We conclude that it is important to understand well what makes pre-alignmentﬁlter ineﬃcient, such that we can devise new ﬁltering technique that is much fasterthan read alignment and yet maintains high ﬁltering accuracy.

Our goal in this thesis is to signiﬁcantly reduce the time spent on calculating theoptimal alignment in genome analysis from hours to mere seconds, given limitedcomputational resources (i.e., personal computer or small hardware). This wouldmake it feasible to analyze DNA routinely in the clinic for personalized healthapplications. Towards this end, we analyze the mappings that are provided toread alignment algorithm, and explore the causes of ﬁltering inaccuracy. Ourthesis statement is: 9 ead alignment can be substantially accelerated using computation-ally inexpensive and accurate pre-alignment ﬁltering algorithms de-signed for specialized hardware.

Accurate ﬁlter designed on a specialized hardware platform can drasticallyexpedite read alignment by reducing the number of locations that must be ver-iﬁed via dynamic programming. To this end, we (1) develop four hardware-acceleration-friendly ﬁltering algorithms and highly-parallel hardware accelera-tor designs which greatly reduce the need for alignment veriﬁcation in DNA readmapping, (2) introduce fast and accurate pre-alignment ﬁlter for general purposeprocessors, and (3) develop a better understanding of ﬁltering inaccuracy andexplore speed/accuracy trade-oﬀs.

The overarching contribution of this thesis is the new algorithms and architecturesthat reduce read alignment’s execution time in read mapping. More speciﬁcally,this thesis makes the following main contributions:1. We provide a detailed investigation and analysis of four potential causes ofﬁltering inaccuracy in the state-of-the-art alignment ﬁlter, SHD [65]. Wealso provide our recommendations on eliminating these causes and improv-ing the overall ﬁltering accuracy.2.

GateKeeper.

We introduce the ﬁrst hardware pre-alignment ﬁltering,GateKeeper, which substantially reduces the need for alignment veriﬁcationin DNA read mapping. GateKeeper is highly parallel and heavily relies onbitwise operations such as look-up table, shift, XOR, and AND. GateKeepercan examine up to 16 mappings in parallel, on a single FPGA chip with alogic utilization of less than 1% for a single ﬁltering unit. It provides twoorders of magnitude speedup over the state-of-the-art pre-alignment ﬁlter,SHD. It also provides up to 13.9x speedup to the state-of-the-art aligners.10ateKeeper is published in Bioinformatics [71] and also available in arXiv[72].3.

Shouji.

We introduce Shouji, a highly accurate and parallel pre-alignmentﬁlter which uses a sliding window approach to quickly identify dissimilarsequences without the need for computationally expensive alignment al-gorithms. Shouji can examine up to 16 mappings in parallel, on a singleFPGA chip with a logic utilization of up to 2% for a single ﬁltering unit. Itprovides, on average, 1.2x to 1.4x more speedup than what GateKeeper pro-vides to the state-of-the-art read aligners due to its high accuracy. Shoujiis 2.9x to 155x more accurate than GateKeeper. Shouji is published inBioinformatics [73] and also available in arXiv [74].4.

MAGNET.

We introduce MAGNET, a highly accurate pre-alignment ﬁl-ter which employs greedy divide-and-conquer approach for identifying allnon-overlapping long matches between two sequences. MAGNET can ex-amine 2 or 8 mappings in parallel depending on the edit distance threshold,on a single FPGA chip with a logic utilization of up to 37.8% for a singleﬁltering unit. MAGNET is, on average, two to four orders of magnitudemore accurate than both Shouji and GateKeeper. This comes at the ex-pense of its ﬁltering speed as it becomes up to 8x slower than Shouji andGateKeeper. It still provides up to 16.6x speedup to the state-of-the-artread aligners. MAGNET is published in IPSI [75], available in arXiv [76],and presented in AACBB2018 [77].11.

SneakySnake.

We introduce SneakySnake, the fastest and the most ac-curate pre-alignment ﬁlter. SneakySnake reduces the problem of ﬁndingthe optimal alignment to ﬁnding a snake’s optimal path (with the leastnumber of obstacles) in linear time complexity in read length. We pro-vide a cost-eﬀective CPU implementation for our SneakySnake algorithmthat accelerates the state-of-the-art read aligners, Edlib [78] and Parasail[54], by up to 43x and 57.9x, respectively, without the need for hardwareaccelerators. We also provide a scalable hardware architecture and hard-ware design optimization for the SneakySnake algorithm in order to furtherboost its speed. The hardware implementation of SneakySnake acceleratesthe existing state-of-the-art aligners by up to 21.5x when it is combinedwith the aligner. SneakySnake is up to one order, four orders, and ﬁveorders of magnitude more accurate compared to MAGNET, Shouji, andGateKeeper, while preserving all correct mappings. SneakySnake also re-duces the memory footprint of Edlib aligner by 50%. This work is yet tobe published.6. We provide a comprehensive analysis of the asymptotic run time and spacecomplexity of our four pre-alignment ﬁltering algorithms. We perform adetailed experimental evaluation of our proposed algorithms using 12 realdatasets across three diﬀerent read lengths (100 bp, 150 bp, and 250 bp) anedit distance threshold of 0% to 10% of the read length. We explore diﬀerentimplementations for the edit distance problem in order to compare theperformance of SneakySnake that calculate approximate edit distance withthat of the eﬃcient implementation of exact edit distance. This also helpsus to develop a deep understanding of the trade-oﬀ between the accuracyand speed of pre-alignment ﬁltering.Overall, we show in this thesis that developing a hardware-based alignment ﬁl-tering algorithm and architecture together is both feasible and eﬀective by build-ing our hardware accelerator on a modern FPGA system. We also demonstratethat our pre-alignment ﬁlters are more eﬀective in boosting the overall perfor-mance of the alignment step than only accelerating the dynamic programmingalgorithms by one to two orders of magnitude.12his thesis provides a foundation in developing fast and accurate pre-alignmentﬁlters for accelerating existing and future read mappers.

This thesis is organized into 11 chapters. Chapter 2 describes the necessarybackground on read mappers and related prior works on accelerating their com-putations. Chapter 3 explores the potential causes of ﬁltering inaccuracy andprovides recommendations on tackling them. Chapter 4 presents the architectureand implementation details of our hardware accelerator that we use for boostingthe speed of our proposed pre-alignment ﬁlters. Chapter 5 presents GateKeeperalgorithm and architecture. Chapter 6 presents Shouji algorithm and its hard-ware architecture. Chapter 7 presents MAGNET algorithm and its hardwarearchitecture. Chapter 8 presents SneakySnake algorithm. Chapter 9 presents thedetailed experimental evaluation for all our proposed pre-alignment ﬁlters alongwith comprehensive comparison with the state-of-the-art existing works. Chapter10 presents conclusions and future research directions that are enabled by thisthesis. Finally, Appendix A extends Chapter 9 with more detailed informationand additional experimental data/results.13 hapter 2Background

In this chapter, we provide the necessary background on two key read map-ping methods. We highlight the strengths and weaknesses of each method. Wethen provide an extensive literature review on the prior, existing, and recent ap-proaches for accelerating the operations of read mappers. We devote the providedbackground materials only to the reduction of read mapper’s execution time.

With the presence of a reference genome, read mappers maintain a large indexdatabase ( ∼ > version 2.5) [19] ﬁrst builds a hash table to index ﬁxed-lengthseeds (typically 10-13 bp) from the reference genome . It then applies a seedlocation ﬁltering mechanism, called Adjacency Filter, on the hash table to reducethe false seed locations. It divides each query read into smaller ﬁxed-length seedsto query the hash table for their associated seed locations. Given an edit distancethreshold, Adjacency Filter requires N-E seeds to exactly match adjacent loca-tions, where N is the number of the seeds and E is the edit distance threshold.Finally, mrFAST tries to extend the read at each of the seed locations by align-ing the read to the reference fragment at the seed location via Levenshtein editdistance [31] with Ukkonen’s banded algorithm [101]. One drawback of this seedﬁltering is that the presence of one or more substitutions in any seed is countedby the Adjacency Filter as a single mismatch. The eﬀectiveness of the AdjacencyFilter for substitutions and indels diminishes when E becomes larger than 3 edits.16 recent work in [102] shows that by removing redundancies in the refer-ence genome and also across the reads, seed-and-extend mappers can be fasterthan BWT-FM-based mappers. This space-eﬃcient approach uses a similar ideapresented in [94]. A hybrid method that incorporates the advantages of eachapproach can be also utilized, such as BWA-MEM [21]. A majority of read mappers are developed for machines equipped with the general-purpose central processing units (CPUs). As long as the gap between the CPUcomputing speed and the very large amount of sequencing data widens, CPU-based mappers become less favorable due to their limitations in accessing data[52, 53, 54, 55, 56, 57, 58, 23, 59, 60]. To tackle this challenge, many attemptswere made to accelerate the operations of read mapping. We survey in Figure2.1 the existing read mappers implemented in various acceleration platforms.FPGA-based read mappers often demonstrate one to two orders of magnitudespeedups against their GPU-based counterparts [53, 68]. Most of the existingworks used hardware platforms to only accelerate the dynamic programming al-gorithms (e.g., Smith-Waterman algorithm [17]), as these algorithms contributedsigniﬁcantly to the overall running time of read mappers. Most existing works canbe divided into three main approaches: (1) Developing seed ﬁltering mechanismto reduce the seed location list, (2) Accelerating the computationally expensivealignment algorithms using algorithmic development or hardware accelerators,and (3) Developing pre-alignment ﬁltering heuristics to reduce the number of in-correct mappings before being examined by read alignment. We describe nexteach of these three acceleration eﬀorts in detail.

The ﬁrst approach to accelerate today’s read mapper is to ﬁlter the seed locationlist before performing read alignment. 17 ears

CPUGPUFPGASSE-SIMDBLAST SSAHABlat MUMmer3ExonerateGMAP GSNAPSOAP SOAP3 SOAP3-dpSOAP2RMAPZOOM ZOOM LiteSeqMapMAQ BLAST+BLASTZMUMmer SOCS Slider SliderIIPASS PASS-bisBowtie SOAP3-FPGA FHASTBowtie 2ProbeMatch WHAMCloudBurstBWA BWA-SW BWA-MEM

BWA-MEM-

FPGA

SHRiMP SHRiMP2SARUMANCUSHAWCUSHAW2 CUSHAW2-GPURazerS RazerS 3BFAST BFAST-Olson BFAST-YusBWA-Waidyasooriya BWA-WmrFAST mrsFAST mrFAST_FastHASH ✦ ● ✹ ■■■■■■■■■■ ■■■■■■■■■■ ■ ■■ ■■■■■ ■■ ■■ ■■■■ ■■ ●● ● ● ✹✹ ✹✹✹✹ ■ ● ✹✦ Figure 2.1: Timeline of read mappers. CPU-based mappers are plotted in black,GPU accelerated mappers in red, FPGA accelerated mappers in blue and SSE-based mappers in green. Grey dotted lines connect related mappers (extensionsor new versions). The names in the timeline are exactly as they appear in pub-lications, except: SOAP3-FPGA [103], BWA-MEM-FPGA [55], BFAST-Olson[104], BFAST-Yus [105], BWA-Waidyasooriya [58], and BWA-W [68].18his is the basic principle of nearly all seed-and-extend mappers. Seed ﬁlteringis based on the observation that if two sequences are potentially similar, then theyshare a certain number of seeds. Seeds (sometimes called q-grams or k-mers) areshort subsequences that are used as indices into the reference genome to reducethe search space and speed up the mapping process. Modern mappers extractshort subsequences from each read and use them as a key to query the previouslybuilt large reference index database. The database returns the location lists foreach seed. The location list stores all the occurrence locations of each seed in thereference genome. The mapper then examines the optimal alignment between theread and the reference segment at each of these seed locations. The performanceand accuracy of seed-and-extend mappers depend on how the seeds are selectedin the ﬁrst stage. Mappers should select a large number of non-overlapping seedswhile keeping each seed as infrequent as possible for full sensitivity [70, 106, 19].There is also a signiﬁcant advantage to selecting seeds with unequal lengths, aspossible seeds of equal lengths can have drastically diﬀerent levels of frequencies.Finding the optimal set of seeds from read sequences is challenging and complex,primarily because the associated search space is large and it grows exponentiallyas the number of seeds increases. There are other variants of seed ﬁltering basedon the pigeonhole principle [18, 61], non-overlapping seeds [19], gapped seeds[107, 63], variable-length seeds [70], random permutation of subsequences [108],or full permutation of all possible subsequences [25, 109, 110].

The second approach to boost the performance of read mappers is to accelerateread alignment step. One of the most fundamental computational steps in mostbioinformatics analyses is the detection of the diﬀerences/similarities between twogenomic sequences. Edit distance and pairwise alignment are two approaches toachieve this step, formulated as approximate string matching [24]. Edit distanceapproach is a measure of how much the sequences diﬀer. It calculates the mini-mum number of edits needed to convert one sequence into the other.19he higher the distance, the more diﬀerent the sequences from one another.Commonly allowed edit operations include deletion, insertion, and substitutionof characters in one or both sequences. Pairwise alignment is a way to identifyregions of high similarity between sequences. Each application employs a dif-ferent edit model (called scoring function), which is then used to generate analignment score. The latter is a measure of how much the sequences are alike.Any two sequences have a single edit distance value but they can have severaldiﬀerent alignments (i.e., ordered lists of possible edit operations and matches)with diﬀerent alignment scores. Thus, alignment algorithms usually involve abacktracking step for providing the optimal alignment (i.e., the best arrangementof the possible edit operations and matches) that has the highest alignment score.Depending on the demand, pairwise alignment can be performed as global align-ment, where two sequences of the same length are aligned end-to-end, or localalignment, where subsequences of the two given sequences are aligned. It canalso be performed as semi-global alignment (called glocal), where the entirety ofone sequence is aligned towards one of the ends of the other sequence.The edit distance and pairwise alignment approaches are non-additive mea-sures [30]. This means that if we divide the sequence pair into two consecutivesubsequence pairs, the edit distance of the entire sequence pair is not necessarilyequivalent to the sum of the edit distances of the shorter pairs. Instead, we needto examine all possible preﬁxes of the two input sequences and keep track ofthe pairs of preﬁxes that provide an optimal solution. Enumerating all possiblepreﬁxes is necessary for tolerating edits that result from both sequencing errors[28] and genetic variations [29]. Therefore, they are typically implemented asdynamic programming algorithms to avoid re-computing the edit distance of thepreﬁxes many times. These implementations, such as Levenshtein distance [31],Smith-Waterman [17], and Needleman-Wunsch [32], are still ineﬃcient as theyhave quadratic time and space complexity (i.e., O( m ) for a sequence length of m ). Many attempts were made to boost the performance of existing sequencealigners. Despite more than three decades of attempts, the fastest known editdistance algorithm [111] has a running time of O( m /log m ) for sequences oflength m , which is still nearly quadratic [112].20herefore, more recent works tend to follow one of two key new directionsto boost the performance of sequence alignment and edit distance implementa-tions: (1) Accelerating the dynamic programming algorithms using hardwareaccelerators. (2) Developing ﬁltering heuristics that reduce the need for thedynamic programming algorithms, given an edit distance threshold. Hard-ware accelerators are becoming increasingly popular for speeding upthe computationally-expensive alignment and edit distance algorithms[113, 114, 115, 116] . Hardware accelerators include multi-core and SIMD (sin-gle instruction multiple data) capable central processing units (CPUs), graphicsprocessing units (GPUs), and ﬁeld-programmable gate arrays (FPGAs). Theclassical dynamic programming algorithms are typically accelerated by comput-ing only the necessary regions (i.e., diagonal vectors) of the dynamic program-ming matrix rather than the entire matrix, as proposed in Ukkonen’s bandedalgorithm [101]. The number of the diagonal bands required for computing thedynamic programming matrix is 2 E +1, where E is a user-deﬁned edit distancethreshold. The banded algorithm is still beneﬁcial even with its recent sequentialimplementations as in Edlib [78]. The Edlib algorithm is implemented in C forstandard CPUs and it calculates the banded Levenshtein distance. Parasail [54]exploits both Ukkonen’s banded algorithm and SIMD-capable CPUs to computea banded alignment for a sequence pair with user-deﬁned scoring matrix andaﬃne gap penalty. SIMD instructions oﬀer signiﬁcant parallelism to the matrixcomputation by executing the same vector operation on multiple operands atonce.Multi-core architecture of CPUs and GPUs provides the ability to computealignments of many sequence pairs independently and concurrently [56, 57].GSWABE [57] exploits GPUs (Tesla K40) for a highly-parallel computation ofglobal alignment with aﬃne gap penalty. CUDASW++ 3.0 [59] exploits theSIMD capability of both CPUs and GPUs (GTX690) to accelerate the computa-tion of the Smith-Waterman algorithm with aﬃne gap penalty. CUDASW++ 3.0provides only the optimal score, not the optimal alignment (i.e., no backtrackingstep). 21ther designs, for instance FPGASW [53], exploit the very large number ofhardware execution units in FPGAs (Xilinx VC707) to form a linear systolic array[117]. Each execution unit in the systolic array is responsible for computing thevalue of a single entry of the dynamic programming matrix. The systolic arraycomputes a single vector of the matrix at a time. The data dependencies betweenthe entries restrict the systolic array to computing the vectors sequentially (e.g.,top-to-bottom, left-to-right, or in an anti-diagonal manner). FPGA accelerationplatform can also provide more speedup to big-data computing frameworks -suchas Apache Spark- for accelerating BWA-MEM [21]. By this integration, Chenet al. [118] achieve 2.6x speedup over the same cloud-based implementation butwithout FPGA acceleration [119]. FPGA accelerators seem to yield the highestperformance gain compared to the other hardware accelerators [52, 53, 68, 58].However, many of these eﬀorts either simplify the scoring function, or only takeinto account accelerating the computation of the dynamic programming matrixwithout providing the optimal alignment as in [66, 67, 59]. Diﬀerent scoringfunctions are typically needed to better quantify the similarity between two se-quences [120, 121]. The backtracking step required for the optimal alignmentcomputation involves unpredictable and irregular memory access patterns, whichposes a diﬃcult challenge for eﬃcient hardware implementation. Comprehen-sive surveys on hardware acceleration for computational genomics appeared in[113, 114, 115, 116, 33] The third approach to accelerate read mapping is to incorporate a pre-alignmentﬁltering technique within the read mapper, before read alignment step. This ﬁlteris responsible for quickly excluding incorrect mappings in an early stage (i.e., asa pre-alignment step) to reduce the number of false mappings (i.e., mappingsthat have more edits than the user-deﬁned threshold) that must be veriﬁed viadynamic programming. Existing ﬁltering techniques include the so-called shiftedHamming distance (SHD) [65], which we explain next.22 .2.3.1 Shifted Hamming Distance (SHD)

SHD enables pre-alignment ﬁltering with the existence of indels and substitutions.Instead of building a single bit-vector using a pairwise comparison as Hammingdistance does, SHD builds 2 E +1 bit-vectors, where E is the user-deﬁned editdistance threshold. This is similar to the Ukkonen’s banded algorithm [101]. Eachbit-vector is built by gradually shifting the read sequence and then performinga pairwise comparison. The shifting process is inevitable in order to skip thedeleted (or inserted) character and examine the subsequent matches. SHD mergesall masks using bitwise AND operation. Due to the use of AND operation, a zero(i.e., pairwise match) at any position in the 2 E +1 masks leads to a ‘0’ in theresulting output of the AND operation at the same position. The last step is tocount the positions that have a value other than ‘0’. SHD decides if the mapping iscorrect based on whether the number of the mismatches exceeds the edit distancethreshold or not. SHD heavily relies on bitwise operations such as shift, XOR,and AND. This makes SHD suitable for bitwise hardware implementations (e.g.,FPGAs and SIMD-enabled CPUs).Our crucial observation is that SHD examines each mapping, throughout theﬁltering process, by performing expensive computations unnecessarily; as SHDuses the same amount of computation regardless the type of edit. SHD is also im-plemented using Intel SSE, which limits the supported read length up to only 128bp (due to SIMD register size). The ﬁltering mechanism of SHD also introducesinaccuracy in its ﬁltering decision as we investigate and demonstrate in Chapter3 and in our experimental evaluation, Chapter 9. We survey in this chapter the existing key directions that aim at accelerating allor part of the operations of modern read mappers. We analyze these attemptsand provide the pros and cons of each direction.23e present three main acceleration approaches, including (1) seed ﬁltering,(2) accelerating the dynamic programming algorithm, (3) pre-alignment ﬁlter-ing. In Figure 1.1, we illustrate that the state-of-the-art mapper mrFAST withFastHASH [19] generates more than 90% of the potential mappings as incor-rect ones, although it implements a seed ﬁltering mechanism (Adjacency Filter)and SIMD-accelerated banded Levenshtein edit distance algorithm. This demon-strates that the development of a fundamentally new, fast, and eﬃcient pre-alignment ﬁlter is the utmost necessity. Note that there is still no work, to bestof our knowledge, on specialized hardware acceleration of pre-alignment ﬁlteringtechniques. 24 hapter 3Understanding and ImprovingPre-alignment Filtering Accuracy

In this chapter, we ﬁrstly provide performance metrics used to evaluate pre-alignment ﬁltering techniques. The essential performance metrics are ﬁlteringspeed and ﬁltering accuracy. We secondly study the causes of ﬁltering inaccuracyof the state-of-the-art pre-alignment ﬁlter, SHD [65], aiming at eliminating them.We ﬁnd four key causes and provide a detailed investigation along with exampleson these inaccuracy sources. This is the ﬁrst work to comprehensively assess theﬁltering inaccuracy of the SHD algorithm [65] and provide recommendations fordesirable improvements.

An ideal pre-alignment ﬁlter should be both fast and accurate in rejecting theincorrect mappings. Meanwhile, it should also preserve all correct mappings.Incorrect mapping is deﬁned as a sequence pair that diﬀers by more than the editdistance threshold. Correct mapping is deﬁned as a sequence pair that has editsless than or equal to the edit distance threshold.25ext, we describe the performance metrics that are necessary to evaluate thespeed and accuracy of existing and future pre-alignment ﬁltering algorithms.

The ﬁltering speed is deﬁned as the time spent by the pre-alignment ﬁlter inexamining all the incoming mappings. We always want to increase the speed ofthe pre-alignment ﬁlter to compensate the computation overhead introduced byits ﬁltering technique.

The false accept rate (or false positive rate) is the ratio between the incorrectmappings that are falsely accepted by the ﬁlter and the incorrect mappings thatare rejected by optimal read alignment algorithm. Similarly, a mapping is con-sidered as a false positive if read alignment accepts it but pre-alignment ﬁlterrejects it. We always want to minimize the false accept rate.

The true accept rate (or true positive rate) is the ratio between the correct map-pings that are accepted by the ﬁlter and the correct mappings that are acceptedby optimal read alignment algorithm. The true accept rate should always equalto 1. 26 .1.2.3 False Reject Rate

The false reject rate (or false negative rate) is the ratio between the correct map-pings that are rejected by the ﬁlter and the correct mappings that are acceptedby optimal read alignment algorithm. The false reject rate should always equalto 0.

The true reject rate (or true negative rate) is the ratio between the incorrectmappings that are rejected by the ﬁlter and the incorrect mappings that arerejected by optimal read alignment algorithm. We always want to maximize thetrue reject rate. In fact, the true reject rate is inversely proportional to the falseaccept rate. However, they can be equivalent in ratio in case of all mappings arecorrect and accepted by the ﬁlter.

Can very fast ﬁlter with high false accept rate be better than more accurate ﬁlterat the cost of its speed? The answer to this question is not trivial because bothspeed and accuracy contribute to the overall speed of read alignment. The onlyway to answer this question is evaluate the eﬀect of such ﬁlter on the overall speedof read alignment step. Thus, we need to evaluate the end-to-end alignment speed.This includes the integration of pre-alignment ﬁlter with read alignment step andevaluate the acceleration rate. Another crucial observation is that pre-alignmentﬁlter applies heuristic approach, which can be optimal for some alignment caseswhile it fails in other cases. Thus, ﬁlter that performs best for speciﬁc read setand edit distance threshold may not perform well for other read sets and editdistance thresholds. The user-deﬁned edit distance threshold, E , is usually lessthan 5% of the read length [18, 65, 122, 62].27 .2 On the False Filtering of SHD algorithm In this section, we investigate the potential causes of ﬁltering inaccuracy thatare introduced by the state-of-the-art ﬁlter, SHD [65] (we describe the algorithmin Chapter 2). We also provide examples that illustrate each of these causes.Adding an additional fast ﬁltering heuristic before the veriﬁcation step in a readmapper can be beneﬁcial. But, such a ﬁlter can be easily worthless if it allowsa high false accept rate. Even though the incorrect mappings that pass SHDare discarded later by the read alignment step (as it has zero false accept rateand zero false reject rate), they can dramatically increase the execution time ofread mapper by causing a mapping to be examined twice unnecessarily by boththe ﬁltering step as well as read alignment step. Below, we describe four majorsources of false positives that are introduced by the ﬁltering strategy of SHD.

The ﬁrst source of false accept rate of SHD [65] is the random zeros that appearin the individual shifted Hamming mask. Although they result from a pairwisecomparison between a shifted read and a reference segment, we refer to them asrandom zeros because they are sometimes meaningless and are not part of thecorrect alignment. SHD ANDs all shifted Hamming masks together with the ideathat all ‘0’s in the individual Hamming masks propagate to the ﬁnal bit-vector,thereby preserving the information of individual matching subsequences. Due tothe use of AND operation, a zero at any position in the 2 E +1 Hamming masksleads to a ‘0’ in the resulting ﬁnal bit-vector at the same position. Hence, evenif some Hamming masks show a mismatch at that position, a zero in some othermasks leads to a match (‘0’) at the same position. This tends to underestimatethe actual number of edits and eventually causes some incorrect mappings to pass.To ﬁx this issue, SHD proposes the so-called speculative removal of short-matches(SRS) before ANDing the masks, which ﬂips short streaks of ‘0’s in each maskinto ‘1’s such that they do not mask out ‘1’s in other Hamming masks.28 AAAAAAAAAAAAAGAGAGAGAGATATTTAGTGTTGCAGCACTACAACACAAAAGAGGACCAACTTACGTGTCTAAAAGGGGGAACATTGTTGGGCCGGAAAAAAAAAAAAAAAGAGAGAGAGATAGTTAGTGTTGCAGCCACTACAACACAAAAGAGGACCAACTTACGTGTCTAAAAGGGGAGACATTGTTGGGCCGG0000000000000000000000000010000000000001111111011110001110110101101111111110001000001111011010010101 0000000000000011111111111110011111011111000000000000000000000000000000000000000000011000000000000000 0000000000000010000000001011011100111111111111101111000111011010110111111111000100010011101101001010 0000000000000010111111111110111011001101110111011000100100111111111111100101100110010110111011101111 0000000000000111111111111110111110111111011101100010010011111111111110010110011000101011101110111110 0000000000001000000000100111110011111111100100011010101001101011111111111110111001111111000111101100 0000000000010111111111110111011001100011111111101011011111100110010111011111111011101111010111001000Query : Reference :Hamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask : 0000000000000000000000000010000000000001111111111110001111111101111111111110001000001111111111111111 0000000000000011111111111111111111111111000000000000000000000000000000000000000000011000000000000000 0000000000000010000000001111111111111111111111111111000111111111111111111111000100011111111111111110 0000000000000011111111111111111111111111111111111000111111111111111111111111111111111111111111111111 0000000000000111111111111111111111111111111111100011111111111111111111111111111000111111111111111110 0000000000001000000000111111111111111111111100011111111111111111111111111111111111111111000111111100 00000000000111111111111111111111111000111111111111111111111111111111111111111111111111111111111110000000000000000000000000000010000000000001000000000000000000000000000000000000000000001000000000000000Hamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector :(AND output) --- Masks after speculative removal of short-matches (SRS) ---Substitution Deletion 2 Substitutions

Figure 3.1: An example of a mapping with all its generated masks, where theedit distance threshold ( E ) is set to 3. The green highlighted subsequences arepart of the correct alignment. The red highlighted bit in the ﬁnal bit-vector is awrong alignment provided by SHD. The correct alignment (highlighted in yellow)shows that there are three substitutions and a single deletion, while SHD detectsonly two substitutions and a single deletion.We illustrate this method in Figure 3.1. The number of zeros to be amended(SRS threshold) is set by default to two. That is, bit streams such as 101, 1001 arereplaced with 111 and 1111, respectively. The 2 E +1 masks contain other zerosthat are part of the correct alignment. For example, Figure 3.1 shown a segmentof consecutive matches in one-step right-shifted mask. This segment indicatesthat there is a single deletion that occurred in the read sequence. Unlike thesemeaningful zeros, random amended zeros can be anywhere in the masks exceptthe two ends of each mask. However, the length and the position of these zeros areunpredictable. They can have any length that makes the SRS method ineﬀectiveat handling these random zeros. There is no clear theory behind the exact SRSthreshold to be used to eliminate such zeros. SRS successfully reduce some ofthe falsely accepted mappings, but it also introduces its own source of falselyaccepted mappings. Choosing a small SRS threshold helps, but does not provideany guarantee, to get rid of some of these random zeros. Choosing a larger SRSthreshold can be risky, since, with such a large threshold, SHD might no longer beable to distinguish whether any streak of consecutive zeros is generated by randomchance or it is part of the correct alignment. This results in SHD ignoring mostof the exact matching subsequences and causes an all-‘1’ ﬁnal bit-vector.29 AAAAAAAAAACCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTTTCAAAAGAAGACATTTATGCAGCCAAAAGACACATGAAAAAAATGCTCATAAAAAAAAAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCATCQuery : Reference : AAAAAAAAAAACCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTTTCAAAAGAAGACATTTATGCAGCCAAAAGACACATGAAAAAAATGCTCAT|||||||||| |||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||| ||||||||| |||||||||||||AAAAAAAAAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATG-AAAAAATGCTCAT0000000000100000000000000000000000000000000000000100000000000000000000000000100000000000000011111111 00000000001111111100001111111111111111111111111111111000111111111111111111111111111111100000111111110000000000111111111000111111111111111111111111111111111111111111111111111111111110001111000011111100 00000000001111111111111111111111111111111111111111111111100001111111111111111111111111111000111111110000000000000111100001111111111111111111111111111111000111111111111111111111111111111100000000000000 0000000001111111100011111111111111111111111111100011111111111111111111111111111000111100000111111100 00000000111111111111111111111111111111111111111111111100001111111111111111111111111111000011111110000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000Needleman-Wunsch Alignment: Random zerosHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector :

Figure 3.2: Examples of an incorrect mapping that passes the SHD ﬁlter due tothe random zeros. While the edit distance threshold is 3, a mapping of 4 edits(as examined at the end of the ﬁgure by Needleman-Wunsch algorithm) passesas a falsely accepted mapping.In Figure 3.2, we provide an example where random zeros dominate and leadto a zero in the ﬁnal bit-vector at their corresponding locations. SRS can addressthe inaccuracy caused by the random 3-bit zeros, which are highlighted by theleft arrow, using an SRS threshold of 3. However, SRS is still unable to solvethe inaccuracy caused by the 15-bit zeros that are highlighted by the right arrow.This is due to the fact that the 15-bit zeros are part of the correct alignment andhence amending them to ones can introduce more falsely accepted mappings.

The second source of high false accept rate of SHD [65] is related to the way inwhich SHD counts the edits in the ﬁnal bit-vector. Amending short streaks of‘0’s to ‘1’s could cause correct mappings to be mistakenly ﬁltered out, as it mayproduce multiple ones in the ﬁnal bit-vector. To ensure that it does not overcountthe number of edits, SHD always assumes the streaks of ‘1’s in the ﬁnal bit-vectoras a side eﬀect of the SRS amendment, and counts only the minimum number ofedits that potentially generate such a streak of ‘1’s. The total number of editsreported by SHD can be much smaller than the actual number of edits.30

AAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAATTCACACTTCTCAAAAGAAGACATTTCTCAGCCAAAAAACACATGAAAAAATGCTCAAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCTQuery : Reference : AAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAATTCACACTTCTCAAAAGAAGACATTT-CTCAGCCAAAAAACACATGAAAAAATGCT|||||||||||||||||||||||||||||||||||||||||||| : ||||||||||||||||||||||| |||||||||||||||||||||||||||||AAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCT0000000000000000000000000000000000000000000011100000000000000000000000111111110000011111110000011111 0000000111111110001111000011111111111111111111111111111110001111111111100000000000000000000000000000 0000000111111111111111100011111111111111111111110001111111111111111111111111111000001111111000001111 0000000111110001111111111111111111111111111111111111111111111000011111111111111100001000111100001111 0000001111111100011110000111111111111111111111111111111100011111111111111111110000100011110000111100 0000011111111111111110001111111111111111111111111111111111111111111111111111110001111111110001111100 00001111100011111111111111111111111111111111111111111111110000111111111111111111111111111111111110000000000000000000000000000000000000000000000011100000000000000000000000100000000000000000000000000000Final bit-vector :Needleman-Wunsch Alignment: Misinterpreted as a single editHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :The 3-bit ones are a result of substitutions and not the amendment

Figure 3.3: An example of an incorrect mapping that passes the SHD ﬁlter dueto conservative counting of the short streak of ‘1’s in the ﬁnal bit-vector.For instance, as illustrated in Figure 3.3, three consecutive substitutions rendera streak of three ‘1’s in the ﬁnal bit-vector. But since SHD always assumes themiddle ‘1’ is the result of an amended ‘0’ by SRS, SHD will only consider thestreak of three ‘1’s as a single edit and let it pass, even if the edit distancethreshold is less than three.

The third source of high false accept rate of SHD [65] is the streaks of zeros thatare located at any of the two ends of each mask. Hence we refer to them asleading and trailing zeros. These streaks of zeros can be in two forms: (1) thevacant bits that are caused by shifting the read against the reference segmentand (2) the streaks of zeros that are not vacant bits. SHD generates 2 E +1 masksusing arithmetic left-shift and arithmetic right-shift operations. For both the leftand right directions, the right-most and the left-most vacant bits, respectively,are ﬁlled with ‘0’s. The number of vacant zeros depends on the number of shiftedsteps for each mask, which is at most equal to the edit distance threshold. Thesecond form of the leading and trailing zeros is the zeros that are located at thetwo ends of the Hamming masks and are not vacant zeros. These streaks of zerosresult from the pairwise comparison (i.e., bitwise XOR). They diﬀer from thevacant bits in that their length is independent of the edit distance threshold.31 ATCAAACAACCCCATCAACAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCATCAAAAAAACAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCGTCQuery : Reference : AAAAAAACAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCGTC||: ||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||:||AATCAAACAACCCCATCAACAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCATC0011000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000100 0011000111100011111111111111111111111111111111111111100011111111111111111111111111111110000011111111001100011111111111111111111111111111111111111111111111111111111111111111111111111000111100001111111100010001000111111110001111111111111111111111111111111111100001111111111111111111111111111000111111000011111111000111111111111111111111111111111111111111000111111111111111111111111111111100000111111110 0011111111111111111111111111111111111111111111111111111111111111111111111111111000111100001111111100 00111000111111111111111111111111111111111111111111111100001111111111111111111111111111000111111000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000Needleman-Wunsch Alignment: Vacant bitsHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector : Vacant bits Trailing zerosLeading zeros

Figure 3.4: Examples of an invalid mapping that passes the SHD ﬁlter due to theleading and trailing zeros. We use an edit distance threshold of 3 and an SRSthreshold of 2. While the regions that are highlighted in green are part of thecorrect alignment, the wrong alignment provided by SHD is highlighted in red.The yellow highlighted bits indicate a source of falsely accepted mapping.The main issue with both forms of leading and trailing zeros is that they alwaysdominate, even if some Hamming masks show a mismatch at that position (dueto the use of the AND operation). This gives the false impression that the readand the reference have a smaller edit distance, even when they diﬀer signiﬁcantly,as explained in Figure 3.4. SRS does not address the inaccuracy caused by theleading and trailing zeros by amending such zeros to ones, due to two reasons:(1) the number of these consecutive zeros is not ﬁxed and thus they can be longerthan the SRS threshold, (2) these consecutive zeros are not surrounded by onesand hence even if SRS threshold is greater than two bits, they are not eligible tobe amended.

The last source of false accept rate of in SHD [65] is the inability of SHD tobacktrack (after generating the ﬁnal bit-vector) the location of each long identicalsubsequence (i.e., the mask that originates the identical subsequence), which ispart of the correct alignment. The source of each subsequence provides a keyinsight into the actual number of edits between each two subsequences. That is,if a subsequence is located in a 2-step right shifted mask, it should indicate thatthere are two deletions before this subsequence.32

AAAAAACAAACAACCCCAGAAAAAGTGGGTGAAGGACTATGAACAGACACTTCTCAAAAGAAGACTTTACTCAGCCAAAAAACACATGAAAAAATGCTAAAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCTQuery : Reference : AAAAAAACAAACAACCCCAG-AAAAAGTGGGTGAAGGACTATGAACAGACACTTCTCAAAAGAAGAC-TTTACTCAGCCAAAAAACACATGAAAAAATGCT||||||||||||||||||| ||||||||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||| AAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGA-TATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCT000000000000000000011000011111111111110000000000000000000000000000111111111111000001111111000001111100000001111111100011100000000000000000111111111111111111100011111110000000000000000000000000000000000000000111111111111111000011111111111111111111111111111111111111111111111111111000001111111000001111 00000001111100011111111000111111111111111111111111111111111110000111111111111111000010001111000011110000001111111100011110001111111111111111111111111111111100011111111111111111110000100011110000111110 0000011111111111111111111111111111111111111111111111111111111111111111111111110001111111110001111100 00001111100011111111111111000111111111111111111111111111110000111111111111111111111111111111111110000000000000000000000110000000000000000000000000000000000000000000001000000000000000000000000000000000Needleman-Wunsch Alignment: Overlapping subsequences can hide some editsHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector : AAAAAAAAAAATTAGCCAGGTGTGGTGGCACCCCCTGCCTATAATCCCAGCTACTCGGGAGGGAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGTTAAAAAAAAAAATTAGCCAGGTGTGGTGGCACATGCCTATAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGTTGQuery : Reference : AAAAAAAAAAATTAGCCAGGTGTGGTGGCACCCCCTGCCTATAATCCCAGCTACTCGGGAGG--GAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGT||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||AAAAAAAAAAATTAGCCAGGTGTGGTGGCAC---ATGCCTATAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGT0000000000000000000000000000000111111111111111111111111111111111111111111111111111111111110000111111000000000001111111111111111111111100011111111111111111111111111111111111111111111111111111100011111100000000000111111111100011111111111111111111111111111111111111111111111111111111111111111111111111100000000000011111111111110000111111111111111111111111111111111000010001111111111111111111100011111110 00000000001111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000001111111111000111111111111111111111111111111111111111111111111111111111111111111100001111000 00000000111111111111100001111111000000000000000000000000000111111111111111111111111111111000111110000000000000000000000000000000000100000000000000000000000000011000000000000000000000000000000000000000Needleman-Wunsch Alignment: Backtracking this subsequence can tell that it is a result of three insertionsHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector : (a)(b)

Figure 3.5: Examples of incorrect mappings that pass the SHD ﬁlter due to (a)overlapping identical subsequences, and (b) lack of backtracking.SHD does not relate this important fact to the number of edits in the ﬁnalbit-vector. The lack of backtracking causes two types of falsely accepted map-ping: (1) the ﬁrst type appears clearly when two of the identical subsequences, inthe individual Hamming masks, are overlapped or nearly overlapped, (2) the sec-ond type happens when the identical subsequences come from diﬀerent Hammingmasks. The issue with the ﬁrst type (i.e., overlapping subsequences) is the factthat they appear as a single identical subsequence in the ﬁnal bit-vector, due tothe use of AND operation. An example of this scenario is given in Figure 3.5 (a).This tends to hide some of the edits and eventually causes some invalid mappingsto pass. The second type of false positives caused by the lack of backtrackinghappens, for example, when an identical subsequence comes from the ﬁrst Ham-ming mask (i.e., with no shift) and the next identical subsequence comes from the3-step left shifted mask. This scenario reveals that the number of edits betweenthe two subsequences should not be less than three insertions.33owever, SHD inaccurately reports it as a single edit (due to ANDing allHamming masks without backtracking the source of each streak of zeros), asillustrated in Figure 3.5 (b). Keeping track of the source mask of each identicalsubsequence prevents such false positives and helps to reveal the correct numberof edits.

In this section, we provide our own observations and recommendations based onour comprehensive accuracy analysis of SHD ﬁlter [65]. We make two crucialobservations. (1) The ﬁrst observation is that handling the short streaks of ‘0’s(i.e., using the SRS method that we discuss above) is indeed ineﬃcient. These“noisy” streaks do not have determined properties, as their length and numberare unpredictable (random-like). They introduce their own sources of falselyaccepted mappings and do not contribute any useful information. Therefore,future ﬁltering strategies should avoid processing such short streaks of ‘0’s. (2)The second observation is that the correct (desired) alignment always containsall the longest non-overlapping identical subsequences. This turns our attention tofocusing on the long matches (that are highlighted in green in all previous ﬁgures,i.e., Figure 3.1 to Figure 3.5) in each Hamming mask. We ﬁnd that the long non-overlapping subsequences of consecutive zeros have two interesting properties. (1)There is an upper bound on their quantity. With the existence of E edits, thereare at most E +1 non-overlapping identical subsequences shared between a pairof sequences. The total length of these non-overlapping subsequences is equal to m-E , where m is the read length. (2) The source mask of each long subsequenceprovides an insight into the number of edits between this subsequence and itspreceding one. These two observations motivate us to incorporate long-match-awareness into the design of our ﬁltering strategy and ignore processing noisyshort matches. 34 .4 Summary We identify four causes that introduce the ﬁltering inaccuracy of the SHD [65]algorithm, namely, the random zeros, conservative counting, leading and trailingzeros, and lack of backtracking. Based on these four sources of falsely acceptedmapping, we observe that there are still opportunities for further improvementson the accuracy of the state-of-the-art ﬁlter, SHD, which we discuss next.35 hapter 4The First Hardware Acceleratorfor Pre-Alignment Filtering

In this chapter, we introduce a new FPGA-based accelerator architecture forhardware-aware pre-alignment ﬁltering algorithms. To our knowledge, this isthe ﬁrst work that exploit reconﬁgurable hardware platforms to accelerate pre-alignment ﬁltering. A fast ﬁlter designed on a specialized hardware platform candrastically expedite alignment by reducing the number of locations that must beveriﬁed via dynamic programming. This eliminates many unnecessary expensivecomputations, thereby greatly improving overall run time.

We select FPGA as an acceleration platform for our proposed pre-alignmentﬁltering algorithms, as its architecture oﬀers large amounts of parallelism[114, 123, 124]. The use of FPGA as an acceleration platform can yield signiﬁcantperformance improvements, especially for massively parallel algorithms.36PGAs are the most commonly used form of reconﬁgurable hardware enginestoday in bioinformatics [125, 126, 115], and their computational capabilities aregreatly increasing every generation due to increased number of transistors on theFPGA chip. An FPGA chip can be programmed (i.e., conﬁgured) to include avery large number of hardware execution units that are custom-tailored to theproblem at hand.

One of our aims is to accelerate our new pre-alignment ﬁltering algorithms (thatwe describe in the next three chapters) by leveraging the capabilities and par-allelism of FPGAs. To this end, we build our own hardware accelerator thatconsists of an FPGA engine as an essential component and a CPU. We present inFigure 4.1 the overall architecture of our FPGA-based accelerator. The CPU isresponsible for acquiring and encoding the short reads and transferring the datato and from the FPGA. The FPGA engine is equipped with PCIe transceivers,Read Controller, Result Controller, and a set of ﬁltering units that are respon-sible for examining the read alignment. The workﬂow of the accelerator startswith reading a repository of short reads and seed locations. All reads are thenconverted into their binary representation that can be understood by the FPGAengine. Encoding the reads is a preprocessing step and accomplished througha Read Encoder at the host before transmitting the reads to the FPGA chip.Next, the encoded reads are transmitted and processed in a streaming fashionthrough the fastest communication medium available on the FPGA board (i.e.,PCIe). We design our system to perform alignment ﬁltering in a streaming fash-ion: the accelerator receives a continual stream of short reads, examines eachalignment in parallel with others, and returns the decision (i.e., a single bit ofvalue ‘1’ for an accepted sequences and ‘0’ for a rejected sequences) back to theCPU instantaneously upon processing. 37 reprocessing Host (CPU) input reads (.fastq) reference genome (.fasta) R ea d E n c od e r Read Mapper map.

Pre-Alignment Filtering (FPGA) Alignment Verification (CPU/FPGA)

PCIe PCIe

A C T T A G C A C T A C T A G A A C T T C T A T A A T A C GCCATATATACG A . . .. . .. . .

Filter

Read ControllerResult Controller

FIFOFIFOFIFOFIFOFIFOFIFO FIFOFIFO F il t e r i ng M odu l es Figure 4.1: Overview of our accelerator architecture.

The Read Controller on the FPGA side is responsible for two main tasks. First, itpermanently assigns the ﬁrst data chunk as a reference sequence for all processingcores. Second, it manages the subsequent data chunks and distributes them tothe processing cores. The ﬁrst processing core receives the ﬁrst read sequenceand the second core receives the second sequence and so on, up to the last core.It iterates the data chunk management task until no more reads are left in therepository.

Following similar principles as the Read Controller, the Result Controller gathersthe output results of the ﬁltering units. Both the Read Controller and the ResultController preserve the original order of reads as in the repository (i.e., at thehost). This is critical to ensure that each read will receive its own alignmentﬁltering result. The results are transmitted back to the CPU side in a streamingfashion and then saved in the repository.38 . mm Figure 4.2: Xilinx Virtex-7 FPGA VC709 Connectivity Kit and chip layout ofour implemented accelerator.

We design our hardware accelerator to exploit the large amounts of parallelismoﬀered by FPGA architectures [114, 123, 124, 127, 128, 129, 130, 131, 132]. Wetake advantage of the fact that alignment ﬁltering of one read is inherently in-dependent of ﬁltering of another read. We therefore can examine many reads ina parallel fashion. In particular, instead of handling each read in a sequentialmanner, as CPU-based ﬁlters (e.g., SHD) do, we can process a large number ofreads at the same time by integrating as many hardware ﬁltering units as possible(constrained by chip area) in the FPGA chip. Each ﬁltering unit is a completealignment ﬁlter and can handle a single read at a time. Our hardware acceleratorcontains large number of ﬁltering units that their number can be conﬁgured bythe user. Each ﬁltering unit provides pre-alignment ﬁltering individually fromall other units. We use the term “ﬁltering unit” in this work to refer to theentire operation of the ﬁltering process involved. Filtering units are part of ourarchitecture and are unrelated to the term “CPU core” or “thread”.39 .4 Hardware Implementation

Our hardware implementation of our accelerator is independent from speciﬁcFPGA-platform as it does not rely on any vendor-speciﬁc computing elements(e.g., intellectual property cores). However, each FPGA board has diﬀerent fea-tures and hardware capabilities that can directly or indirectly aﬀect the perfor-mance and the data throughput of the design. In fact, the number of ﬁlteringunits is determined by the maximum data throughput and the available FPGAresources. We use a Xilinx Virtex 7 VC709 board [133] to implement our accelera-tor architecture. We build the FPGA design with Vivado 2015.4 in synthesizableVerilog. We preset the chip layout of out hardware accelerator in Figure 4.2.The maximum operating frequency of our accelerator and the VC709 board is250 MHz. At this frequency, we observe a data throughput of nearly 3.3 GB/s,which corresponds to 13.3 billion bases per second. This nearly reaches the peakthroughput of 3.64 GB/s provided by the RIFFA [134] communication channelthat feeds data into the FPGA using Gen3 4-lane PCIe.

We introduce in this chapter a new hardware accelerator architecture that exploitthe large amounts of parallelism oﬀered by FPGA architectures to boost theperformance of our pre-alignment ﬁlters. Our hardware accelerator processes thepre-alignment ﬁltering for each sequence pair independently from each another.We therefore can examine many reads in a parallel fashion. We build the hardwarearchitecture of our hardware accelerator using many hardware ﬁltering units,where each ﬁltering unit is a complete pre-alignment ﬁlter and can handle asingle read at a time. To take full advantage of the capabilities and parallelismof our FPGA accelerator, each pre-alignment ﬁltering unit needs to be designedand implemented using FPGA-supported operations such as bitwise operations,bit shifts, and bit count. Next, we discuss our proposed pre-alignment ﬁlters thatcan be included in our FPGA accelerator as a ﬁltering unit.40 hapter 5GateKeeper: Fast HardwarePre-Alignment Filter

In this chapter, we introduce a new FPGA-based fast alignment ﬁltering technique(called GateKeeper) that acts as a pre-alignment step in read mapping. Ourﬁltering technique improves and accelerates the state-of-the-art SHD ﬁlteringalgorithm [65] using new mechanisms and FPGAs.

Our new ﬁltering algorithm has two properties that make it suitable for an FPGA-based implementation: (1) it is highly parallel, (2) it heavily relies on bitwiseoperations such as shift, XOR, and AND. Our architecture discards the incorrectmappings from the candidate mapping pool in a streaming fashion – data isprocessed as it is transferred from the host system. Filtering the mappings in astreaming fashion gives the ability to integrate our ﬁlter with any mapper thatperforms alignment, such as Bowtie 2 [22] and BWA-MEM [21]. Our currentﬁlter implementation relies on several optimization methods to create a robustand eﬃcient ﬁltering approach. 41t both the design and implementation stages, we satisfy several requirements:(1) Ensuring a lossless ﬁltering algorithm by preserving all correct mappings.(2) Supporting both Hamming distance and edit distance. (3) Examining thealignment between a read and a reference segment in a fast and eﬃcient way (interms of execution time and required resources).

Our primary purpose is to enhance the state-of-the-art SHD alignment ﬁlter suchthat we can greatly accelerate pre-alignment by taking advantage of the capa-bilities and parallelism of FPGAs. To achieve our goal, we design an algorithminspired by SHD to reduce both the utilized resources and the execution time.These optimizations enable us to integrate more ﬁltering units within the FPGAchip and hence examine many mappings at the same time. We present threenew methods that we use in each GateKeeper ﬁltering unit to improve execu-tion time. Our ﬁrst method introduces a new algorithmic method for performingalignment very rapidly compared to the original SHD. This method provides: (1)fast detection for exact matching alignment and (2) handling of one or more base-substitutions. Our second method supports calculating the edit distance with anew, very eﬃcient hardware design. Our third method addresses the problem ofhardware resource overheads introduced due to the use of FPGA as an acceler-ation platform. All methods are implemented within the hardware ﬁltering unitof our accelerator (see Chapter 4) and thus are performed highly eﬃciently. Wepresent a ﬂowchart representation of all steps involved in our algorithm in Figure5.1. Next, we describe the three new methods.

We ﬁrst discuss how to examine a mapping with a given Hamming distancethreshold, and later extend our solution to support edit distance.42ur ﬁrst method aims to quickly detect the obviously-correct alignments thatcontain no edits or only few substitutions (i.e., less than the user-deﬁned thresh-old). If the ﬁrst method detects a correct alignment, then we can skip the othertwo methods but we still need the optimal alignment algorithms. A read is map-pable if the Hamming distance between the read and its seed location does notexceed the given Hamming distance threshold. Hence, the ﬁrst step is to identifyall bp matches by calculating what we call a Hamming mask. The Hamming maskis a bit-vector of ‘0’s and ‘1’s representing the comparison of the read and the ref-erence, where a ‘0’ represents a bp match and a ‘1’ represents a bp mismatch. Weneed to count only occurrences of ‘1’ in the Hamming mask and examine whethertheir total number is equal to or less than the user-deﬁned Hamming distancethreshold. If so, the mapping is considered to be valid and the read passes theﬁlter. Similarly, if the total number of ‘1’ is greater than the Hamming distancethreshold then we cannot be certain whether this is because of the high number ofsubstitutions, or there exist insertions and/or deletions; hence, we need to followthe rest of our algorithm. Our ﬁlter can detect not only substitutions but alsoinsertions and deletions in an eﬃcient way, as we discuss next.

Our indel detection algorithm is inspired by the original SHD algorithm presentedin [65]. If the substitution detection rejects an alignment, then GateKeeper checksif an insertion or deletion causes the violation (i.e., high number of edits). Fig-ure 5.2 illustrates the eﬀect of occurrence of edits on the alignment process. Ifthere are one or more base-substitutions or the alignment is exactly matching,the matching and mismatching regions can be accurately determined using Ham-ming distance. It also helps detect the matches that are located before the ﬁrstindel. However, this mask is already generated as part of the ﬁrst method of thealgorithm (i.e., Fast Approximate String Matching). On the other hand, eachinsertion and deletion can shift multiple trailing bases and create multiple editsin the Hamming mask. Thus, our indel detection method identiﬁes whether thealignment locations of a read are valid, by shifting individual bases.43 onvert the two segments into their binary representationsStartApply “Fast Approximate String Matching” (Section 2.3.1) P r e p r o c e ss i n g H o s t H o s t A li g n m e n t F il t e r i n g ( F P G A ) ACGTC ….ATTGC ….0010011101 ….0011111001 ….Generates one maskGenerates 2E masksReduces the length of each mask to a halfGenerates the final bit-vector mask

Figure 5.1: A ﬂowchart representation of the GateKeeper algorithm.44

CCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCCTCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCCTCCATTGACACTCGTGAGCTGCACCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓TCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↺ ↺

Substitution:

TCCATTGACAGTTCGTGAGCTGCTCCTTCTTCTCCCACCCCTTTGCTCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙

TCCATTGACATTCGGAGCTGCTCCTTCTCTCCACCCCTTTGCCCTTTCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↘↘↘↘↘↘↘↘↘↘↘↘↘↘↘↘

Insertion:Deletion:Exact Match: ↓ ↘↙ Substitution, Deletion or Insertion ↺ Match Mismatch1-step shift match 2-step shift match (a)(b)(c)(d)

Figure 5.2: An example showing how various types of edits aﬀect the alignmentof two reads. In (a) the upper read exactly matches the lower read and thuseach base exactly matches the corresponding base in the target read. (b) showsbase-substitutions that only aﬀect the alignment at their positions. (c) and (d)demonstrate insertions and deletions, respectively. Each edit has an inﬂuence onthe alignment of all the subsequent bases.We need to perform E incremental shifts to the right (or left) direction todetect any read that has E deletions (or insertions), where E is the edit distancethreshold. The shift process guarantees to cancel the eﬀect of indel. As we donot have prior knowledge about whether there exist substitutions, or indels, orcombination of both, we need to test for every possible case in our algorithm.Thus, GateKeeper generates 2 E +1 Hamming masks regardless the source of theedit. The last step is to merge all the 2 E +1 Hamming masks using a bitwiseAND operation. This step tells us where the relevant matching and mismatchingregions reside in the presence of edits in the read compared to the referencesegment. Similarly to SHD, we apply amending process to the 2 E +1 masks beforeperforming AND operation. In SHD, the amending process is accomplished usinga 4-bit packed shuﬄe (SIMD parallel table-lookup instruction), shift, and ORoperations. The number of computations needed is 4 packed shuﬄe, 4 m bitwiseOR, and three shift operations for each Hamming mask, which is (7+4 m )(2 E +1)operations, where m is the read length. We ﬁnd that this is very ineﬃcient forFPGA implementation. 45 Hamming mask:

Hamming mask after amending: . . . . . . . . . .

Figure 5.3: Workﬂow of the proposed architecture for the parallel amendmentoperations.To reduce the number of operations, we build a new hardware-based amendingprocess. We propose using dedicated hardware components in FPGA slices. Moreprecisely, rather than shifting the read and then performing packed shuﬄe toreplace patterns of 101 or 1001 to 111 or 1111 respectively, we perform onlypacked shuﬄe independently and concurrently for each bit of each Hammingmask. We present the proposed architecture for amendment operation in Figure5.3. In order to replace all patterns of 101 or 1001 to 111 or 1111 respectively,we use a single 5-input look-up table (LUT) for each bit of the Hamming mask.The ﬁrst LUT copies the bit value of the ﬁrst input regardless of its value; evenif it is zero, it will not be amended as it is not contributing to the 101 or 1001pattern.Likewise for the last LUT. Thus, the total number of LUTs needed is equalto the length of the short read in bases minus 2 for the ﬁrst and last bases. Ineach LUT, we consider a single bit of the Hamming mask and two of its rightneighboring bits and two of its left neighboring bits. If the input that correspondsto the output has a bit value of one, then the output copies the value of that inputbit (as we only amend zeros). Otherwise, using the previous two bits and thefollowing two bits with respect to the input bit, we can replace any zero of the“101” or “1001” patterns independently from other output bits. All bits of theamended masks are generated at the same time, as the propagation delay throughan FPGA look-up table is independent of the implemented function [135].46hus we can process all masks in a parallel fashion without aﬀecting the cor-rectness of the ﬁltering decision. Using this dedicated architecture, we are able toget rid of the four shifting operations and perform the amending process concur-rently for all bits of any Hamming mask. Thus, the required number of operationsis only (2 E +1) instead of (7+4 m )(2 E +1) for a total of (2 E +1) Hamming masks.This saves a considerable amount of the ﬁltering time, reducing it by two ordersof magnitude for a read that is 100bp long. The short reads are composed of a string of nucleotides from the DNA alphabet { A, C, G, T } . Since the reads are processed in an FPGA platform, the symbolshave to be encoded into a unique binary representation. We need 2 bits to encodeeach symbol. Hence, encoding a read sequence of length m results in a 2 m -bitword. Encoding the reads into a binary representation introduces overhead toaccommodate not only the encoded reads but also the Hamming masks as theirlengths also double (i.e., 2 m ). The issue introduced by encoding the read canbe even worse when we apply certain operations on these Hamming masks. Forexample, the number of LUTs required for performing the amending process onthe Hamming masks will be doubled, mainly due to encoding the read.To reduce the complexity of the subsequent operations on the Hamming masksand save about half of the required amount of FPGA resources, we propose a newsolution. We observe that comparing a pair of DNA nucleotides is similar to com-paring their binary representations (e.g., comparing A to T is similar to comparing‘00’ to ‘11’). Hence, comparing each two bits from the binary representation ofthe read with their corresponding bits of the reference segment generates a singlebit that represents one of two meanings; either match or mismatch between twobases. 47 CCATTCCAG

Read:Reference: Nucleotide

Binary (after encoding)

0: Match 1: Mismatch

Hamming mask:Representation: ⊕ The modified Hamming mask after applying our solution

2m bitsm bits

Figure 5.4: An example of applying our solution for reducing the number of bitsof each Hamming mask by half. We use a modiﬁed Hamming mask to store theresult of applying the bitwise OR operation to each two bits of the Hammingmask. The modiﬁed mask maintains the same meaning of the original Hammingmask.This is performed by encoding each two bits of the result of the pairwisecomparison (i.e., bitwise XOR) into a single bit of ‘0’ or ‘1’ using OR operationsin a parallel fashion, as explained in Figure 5.4. This makes the length of eachHamming mask equivalent to the length of the original read, without aﬀectingthe meaning of each bit of the mask. The modiﬁed Hamming masks are thenmerged together in 2 E bitwise AND operations. Finally, we count the numberof ones (i.e., edits) in the ﬁnal bit-vector mask; if the count is less than the editdistance threshold, the ﬁlter accepts the mapping. In this section, we provide the pseudocode of our GateKeeper algorithm. Algo-rithm 5.1 presents the main functions. Algorithms 5.2 presents the details of theamending process using in Algorithm 5.1.48 lgorithm 5.1:

GateKeeper filtering algorithm

Input: Candidate read bit-vector B = { b ,b ...b m }, Reference bit-vector A={ a ,a ...a m }, edit distance threshold E Output: Pass (return True if the read passes the GateKeeper filter). Functions: Amend: Encodes/amends the masks. Pseudocode:

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: //Calculate Hamming distance first.

HammingMask[2E+1] ← A ⊕ B; AmendedMask[2E+1] ←

Amend (HammingMask[2E+1]); e ← //Generate 2E masks with incremental shift. for i ← 1 to E do

HammingMask[i] ← (b>>i) ⊕ a;

AmendedMask[i] ←

Amend (HammingMask[i]); HammingMask[i+E] ← (b<

Input: Hamming mask bit-vector, H = { H ,H ...H } Output: modified Hamming mask, C = { C ,C ...C m } Pseudocode:

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: i ← 0; while i

We introduce the ﬁrst hardware acceleration system for alignment ﬁltering, calledGateKeeper. We develop both a hardware-acceleration-friendly ﬁltering algo-rithm and a highly-parallel hardware accelerator design. GateKeeper is a stan-dalone ﬁlter and can be integrated with any existing reference-based mapper.GateKeeper does not replace read alignment. GateKeeper should be followed byread alignment step, which precisely veriﬁes the mappings that pass our ﬁlter andeliminates the falsely accepted ones. 51 hapter 6Shouji: Fast and AccurateHardware Pre-Alignment Filter

Our primary purpose is to reject incorrect mappings accurately and quickly suchthat we reduce the need for the computationally expensive alignment step. In thischapter, we propose the Shouji algorithm to achieve highly accurate ﬁltering. Wethen accelerate Shouji by taking advantage of the capabilities and parallelism ofFPGAs to achieve fast ﬁltering operations. Shouji’s ﬁltering strategy is inspiredby our analytical study of SHD’s ﬁltering accuracy (see Chapter 3). We discussthe details of the Shouji algorithm next.

The key ﬁltering strategy of Shouji is inspired by the pigeonhole principle ,which states that if E items are distributed into E +1 boxes, then one or moreboxes would remain empty. In the context of pre-alignment ﬁltering, this principleprovides the following key observation: if two sequences diﬀer by E edits, thenthe two sequences should share at least a single common subsequence (i.e., freeof edits) and at most E +1 non-overlapping common subsequences, where E is52 e e E m m m m E+1

Figure 6.1: Random edit distribution in a read sequence. The edits ( e , e , . . . ,e E ) act as dividers resulting in several identical subsequences ( m , m , . . . , m E +1 )between the read and the reference.the edit distance threshold. With the existence of at most E edits, the totallength of these non-overlapping common subsequences should not be less than m-E , where m is the sequence length, as illustrated in Figure 6.1. Shouji employsthe pigeonhole principle to decide whether or not two sequences are potentiallysimilar. Shouji ﬁnds all the non-overlapping subsequences that exist in bothsequences. If the total length of these common subsequences is less than m-E , then there exist more edits than the allowed edit distance threshold, andhence Shouji rejects the two given sequences. Otherwise, Shouji accepts the twosequences. Next, we discuss the details of Shouji. Shouji identiﬁes the dissimilar sequences, without calculating the optimal align-ment, in three main steps. (1) The ﬁrst step is to construct what we call a neighborhood map that visualizes the pairwise matches and mismatches betweentwo sequences given an edit distance threshold of E characters. (2) The secondstep is to ﬁnd all the non-overlapping common subsequences in the neighborhoodmap using a sliding search window approach. (3) The last step is to accept orreject the given sequence pairs based on the length of the found matches. If thelength of the found matches is small, then Shouji rejects the input sequence pair.53 .2.1 Method 1: Building the Neighborhood Map The neighborhood map, N , is a binary m by m matrix, where m is the readlength. Given a text sequence T [1. . . m ], a pattern sequence P [1. . . m ], and anedit distance threshold E , the neighborhood map represents the comparison resultof the i th character of P with the j th character of T , where i and j satisfy 1 ≤ i ≤ m and i-E ≤ j ≤ i+E , the entry N [ i, j ] of the neighborhood map can becalculated as follows: N [ i,j ] =  P [ i ] = T [ j ]1 P [ i ] (cid:54) = T [ j ] (6.1)We present in Figure 6.2 an example of a neighborhood map for two sequences,where sequence B diﬀers from sequence A by three edits. The entry N [ i,j ] is setto zero if the i th character of the pattern matches the j th character of the text.Otherwise, it is set to one. The way we build our neighborhood map ensures thatcomputing each of its entries is independent of every other, and thus the entiremap can be computed all at once in a parallel fashion. Hence, our neighborhoodmap is well suited for highly-parallel computing platforms [71, 136]. Note that insequence alignment algorithms, computing each entry of the dynamic program-ming matrix depends on the values of the immediate left, upper left, and upperentries of its own. Diﬀerent from “dot plot” or “dot matrix” (visual representationof the similarities between two closely similar genomic sequences) that is used inFASTA/FASTP [137], our neighborhood map computes only necessary diagonalsnear the main diagonal of the matrix (e.g., for E =3, Shouji computes only 2 E +1diagonal vectors of the neighborhood map). The key goal of this step is to accurately ﬁnd all the non-overlapping commonsubsequences shared between a pair of sequences. The accuracy of ﬁnding thesesubsequences is crucial for the overall ﬁltering accuracy, as the ﬁltering decision54

1 2 3 4 5 6 7 8 9 10 11 12G G T G C A G A G C T CGGTGAGAGTTGTi123456789101112

Neighborhood map:

0 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1Three common subsequences

SearchWindow

0 0 0 0 1 0 0 0 0 1 0 1

Last bottom right entrySearch window

Shouji bit-vector: . . . .

Figure 6.2: Neighborhood map ( N ) and the Shouji bit-vector, for text T =GGTGCAGAGCTC, and pattern P = GGTGAGAGTTGT for E =3. The threecommon subsequences (i.e., GGTG, AGAG, and T) are highlighted in yellow. Weuse a search window of size 4 columns (two examples of which are high-lighted inred) with a step size of a single column. Shouji searches diagonally within eachsearch window for the 4-bit vector that has the largest number of zeros. Oncefound, Shouji examines if the found 4-bit vector maximizes the number of zerosat the corresponding location of the 4-bit vector in the Shouji bit-vector. If so,then Shouji stores this 4-bit vector in the Shouji bit-vector at its correspondinglocation. 55s made solely based on total subsequence length. With the existence of E ed-its, there are at most E +1 non-overlapping common subsequences (based on thepigeonhole principle) shared between a pair of sequences. Each non-overlappingcommon subsequence is represented as a streak of diagonally-consecutive zeros inthe neighborhood map (as highlighted in yellow in Figure 6.2). These streaks ofdiagonally-consecutive zeros are distributed along the diagonals of the neighbor-hood map without any prior information about their length or number. One wayof ﬁnding these common subsequences is to use a brute-force approach, which ex-amines all the streaks of diagonally-consecutive zeros that start at the ﬁrst columnand selects the streak that has the largest number of zeros as the ﬁrst commonsubsequences. It then iterates over the remaining part of the neighborhood mapto ﬁnd the other common subsequences. However, this brute-force approach isinfeasible for highly-optimized hardware implementation as the search space isunknown at design time. Shouji overcomes this issue by dividing the neighbor-hood map into equal-size parts. We call each part a search window. Limiting thesize of the search space from the entire neighborhood map to a search windowhas three key beneﬁts. (1) It helps to provide a scalable architecture that can beimplemented for any sequence length and edit distance threshold. (2) Downsizingthe search space into a reasonably small sub-matrix with a known dimension atdesign time limits the number of all possible permutations of each bit-vector to2 n , where n is the search window size. This reduces the size of the look-up tables(LUTs) required for an FPGA implementation and simpliﬁes the overall design.(3) Each search window is considered as a smaller sub-problem that can be solvedindependently and rapidly with high parallelism. Shouji uses a search windowof 4 columns wide, as we illustrate in Figure 6.2. We need m search windowsfor pro-cessing two sequences, each of which is of length m characters. Eachsearch window overlaps with its next neighboring search window by 3 columns.This ensures covering the entire neighborhood map and ﬁnding all the commonsubsequences regardless of their starting location. We select the width of eachsearch window to be 4 columns to guarantee ﬁnding the shortest possible com-mon subsequence, which is a single match located between two mismatches (i.e.,‘101’). However, we observe that the bit pattern ‘101’ is not always necessarilya part of the correct alignment (or the common subsequences). For example, the56it pattern ‘101’ exists once as a part of the correct alignment in Figure 6.2, butit also appears ﬁve times in other diﬀerent locations that are not included in thecorrect alignment. To improve the accuracy of ﬁnding the diagonally-consecutivematches, we increase the length of the diagonal vector to be examined to fourbits. We also experimentally evaluate diﬀerent window sizes in Figure 6.3. Weﬁnd that a window size of 4 columns provides the highest ﬁltering accuracy with-out falsely rejecting similar sequences. This is because individual matches (i.e.,single zeros) are usually useless and they are not necessarily part of the com-mon subsequences. As we increase the search window size, we are ignoring theseindividual matches and instead we only look for longer streaks of consecutivezeros.Shouji ﬁnds the diagonally-consecutive matches that are part of the commonsubsequences in the neighborhood map in two main steps. Step 1: For each searchwindow, Shouji ﬁnds a 4-bit diagonal vector that has the largest number of zeros.Shouji greedily considers this vector as a part of the common subsequence as ithas the least possible number of edits (i.e., 1’s). Finding always the maximumnumber of matches is necessary to avoid overestimating the actual number ofedits and eventually preserving all similar sequences. Shouji achieves this step bycomparing the 4 bits of each of the 2 E +1 diagonal vectors within a search windowand selects the 4-bit vector that has the largest number of zeros. In the case wheretwo 4-bit subsequences have the same number of zeros, Shouji breaks the ties byselecting the ﬁrst one that has a leading zero. Then, Shouji slides the searchwindow by a single column (i.e., step size = 1 column) towards the last bottomright entry of the neighborhood map and repeats the previous computations.Thus, Shouji performs “Step 1” m times using m search windows, where m isthe sequence length. Step 2: The last step is to gather the results found foreach search window (i.e., 4-bit vector that has the largest number of zeros) andconstruct back all the diagonally-consecutive matches. For this purpose, Shoujimaintains a Shouji bit-vector of length m that stores all the zeros found in theneighborhood map as we illustrate in Figure 6.2. For each sliding search window,Shouji examines if the selected 4-bit vector maximizes the number of zeros inthe Shouji bit-vector at the same corresponding location. If so, Shouji stores the57

3% 17% 4% 1%0%15%30%45%60% 1 2 3 4 F a l s e A cc e p t R a t e Window Size (bits)

Figure 6.3: The eﬀect of the window size on the rate of the falsely accepted se-quences (i.e., dissimilar sequences that are considered as similar ones by Shoujiﬁlter). We observe that a window width of 4 columns provides the highest ac-curacy. We also observe that as window size increases beyond 4 columns, moresimilar sequences are rejected by Shouji, which should be avoided.selected 4-bit vector in the Shouji bit-vector at the same corresponding location.This is necessary to avoid overestimating the number of edits between two givensequences. The common subsequences are represented as streaks of consecutivezeros in the Shouji bit-vector.

The last step of Shouji is to calculate the total number of edits (i.e., ones) inthe Shouji bit-vector. Shouji examines if the total number of ones in the Shoujibit-vector is greater than E . If so, Shouji excludes the two sequences from theoptimal alignment calculation. Otherwise, Shouji considers the two sequencessimilar within the allowed edit distance threshold and allows their optimal align-ment to be computed using optimal alignment algorithms. The Shouji bit-vectorrepresents the diﬀerences between two sequences along the entire length of thesequence, m . However, Shouji is not limited to end-to-end edit distance calcula-tion. Shouji is also able to provide edit distance calculation in local and glocal58 ead : TTTTACTGTTCTCCCTTTGAATACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAATTACGTTTTAAAA Reference : TTTTACTGTTCTCCCTTTGAAATGACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAAATTACCGTTTTUpper Diagonal-4 : ----110111111100111111110101100001010001011010011111101101100110110011010101011101111111101011000000Upper Diagonal-3 : ---0110110101011111111111110111111111110010011110111111001000100100010011111110110111111000000110001Upper Diagonal-2 : --00111101100101101110110000000000000000000000000000000000000000000000000000000000000000010111110011Upper Diagonal-1 : -000111110111001001100011101111111111100100111101111110010001001000100111111101101111110111111110111Main Diagonal : 0000000000000000000001110110000101000101101001111110110110011011001101010101110111111111101111111111Lower Diagonal-1 : 000111110111001001101011010111111111011111011111101111111011111101111011111100001011010101101111111-Lower Diagonal-2 : 00111101100101101111011111100100010101110011100111011011111111111111010101111011010101001100111111--Lower Diagonal-3 : 0110110101011111111010110101111111011110111111111101101101111110111110111101111111111111110011111---Lower Diagonal-4 : 110111111100111110110001111100000101110101100111110010100111110011100100111101011011111111000111----Shouji bit-vector : 0000000000000000000100010000000000000000000000000000000000000000000000000000000000000000000001000000 Figure 6.4: An example of applying Shouji ﬁltering algorithm to a sequencepair, where the edit distance threshold is set to 4. We present the content ofthe neighborhood map along with the Shouji bit-vector. We apply the Shoujialgorithm starting from the leftmost column towards the rightmost column.(semi-global) fashion. For example, achieving local edit distance calculation re-quires ignoring the ones that are located at the two ends of the Shouji bit-vector.Achieving glocal edit distance calculation requires excluding the ones that arelocated at one of the two ends of the Shouji bit-vector from the total count of theones in the Shouji bit-vector. This is important for correct pre-alignment ﬁlter-ing for global, local, and glocal alignment algorithms. We present an example ofapplying Shouji ﬁltering algorithm in Figure 6.4.

Shouji ﬁlter does not ﬁlter out similar sequences; hence, it provides zero false re-ject rate. The reason behind that is the way we ﬁnd the identical subsequences.We always look for the subsequences that has the largest number of zeros, suchthat we maximize the number of matches and minimize number of edits that causethe division of one long identical sequence into shorter subsequences. This alsoallows for a very small portion of dissimilar sequences to pass. Next, we analyzethe computational complexity (i.e., asymptotic run time and space complexity)of our Shouji ﬁlter. Shouji ﬁlter divides the problem of ﬁnding the identicalsubsequences into at most m subproblems, as described in Algorithm 6.1. Eachsubproblem examines each of the 2 E +1 bit-vectors and ﬁnds the 4-bit subse-quence that has the largest number of zeros within the sliding window. Oncefound, Shouji ﬁlter also compares the found subsequence with its corresponding59ubsequence of the Shouji bit-vector. Now, let c be a constant representing therun time of examining each 4 bits of each bit-vector. Then the time complexityof the Shouji algorithm is as follows: T Shouji (m) = c . m . (2 E + 1) (6.2)This demonstrates that the Shouji algorithm runs in linear time with respectto the sequence length and edit distance threshold. Shouji algorithm maintains2 E +1 diagonal bit-vectors and an additional auxiliary bit-vector (i.e., Shoujibit-vector) for each two given sequences. The space complexity of the Shoujialgorithm is as follows: D Shouji (m) = m . (2 E + 2) (6.3)Hence, the Shouji algorithm requires linear space with respect to the read lengthand edit distance threshold. Next, we outline the hardware implementation de-tails of Shouji ﬁlter. 60 lgorithm 6.1 : Shouji Input : Seq

Output : Pseudocode: m ← length(Seq ; 2: // Build Neighborhood map (N) for i ← do for j ← i-E to i+E do if Seq then N[i,j] ← ; else N[i,j] ← ; // Sliding window search, function CZ() returns for i ← do Shouji[i] ← ; 10: for i ← do for j ← do // Compare with upper diagonal and lower diagonal if CZ(N[i+j:i+3+j,i:i+3]) > CZ(N[i:i+3,i+j:i+3+j]) then Z ← N[i+j:i+3+j,i:i+3] ; 15: else if CZ(N[i+j:i+3+j,i:i+3]) == CZ(N[i:i+3,i+j:i+3+j]) then if N[i+j,i]==0 then Z ← N[i+j:i+3+j,i:i+3] ; 17: else if N[i,i+j]==0 then Z ← N[i:i+3,i+j:i+3+j] ; 18: else Z ← N[i:i+3,i+j:i+3+j] ; 19: // Compare Z with main diagonal and Shouji bit-vector if CZ(N[i:i+3,i:i+3]) > CZ(Z) then Z ← N[i:i+3,i:i+3]; if CZ(

Z) > CZ(Shouji[i:i+3]) then

Shouji[i:i+3] ← Z ; 23: if CZ(Shouji) ≥ m-E then return ; 24: else return ; To make the best use of the available resources in the FPGA chip, our algorithmutilizes the operations that are easily supported on an FPGA, such as bitwiseoperations, bit shifts, and bit count. To build the neighborhood map on theFPGA, we use the observation that the main diagonal can be implemented usinga bitwise XOR operation between the two given sequences. The vacant bits61 mmText . . . . . . m search windows for processing sequences of length m characters PatternEdit distance threshold Search Window m E +1 diagonals . . .. . . Search Window m-1 E +1 diagonals . . .. . . E +1 diagonals . . .. . . B u il d i ng N e i ghbo r hood M a p ( E + d i agona l b i t - v e c t o r s ) . . . Z Z m-1 Z m . . . . . . se l ec t t h e vec t o r t h a t h as t h e h i gh es t o f ' s Search Window 1 E +1 ≥ m-E ?

1: similar0: dissimilar

Step 1 Step 2 Step 3

Figure 6.5: A block diagram of the sliding window scheme implemented in FPGAfor a single ﬁltering unit.due to the shift operation are ﬁlled with ones. The upper E diagonals can beimplemented by gradually shifting the pattern ( P ) to the right-hand directionand then performing bitwise XOR with the text ( T ). This allows each characterof P to be compared with the right-hand neighbor characters (up to E characters)of its corresponding character of T . The lower E diagonals can be implemented ina way similar to the upper E diagonals, but here the shift operation is performedin the left-hand direction. This ensures that each character of P is comparedwith the left-hand neighbor characters (up to E characters) of its correspondingcharacter of T . We also build an eﬃcient hardware architecture for each searchwindow of the Shouji algorithm. It quickly ﬁnds the number of zeros in each 4-bitvector using a hardware look-up table that stores the 16 possible permutations ofa 4-bit vector along with the number of zeros for each permutation. As presentedin Figure 6.5, each counter counts the number of zeros in a single bit-vector. Thecounter takes four bits as input and generate three bits that represents the numberof zeros within the window. Each counter requires three 4-input LUTs. In total,we need 6 E +6 4-input LUTs to build a single search window. All bits of thecounter output are generated at the same time, as the propagation delay throughan FPGA look-up table is independent of the implemented function [135]. Thecomparator is responsible for selecting the 4-bit subsequence that maximizes thenumber of consecutive matches based on the output of each counter and the Shouji62it-vector. Finally, the selected 4-bit subsequence is then stored in the Shouji bit-vector at the same corresponding location. Our hardware implementation of theShouji ﬁltering unit is independent of the speciﬁc FPGA-platform as it does notrely on any vendor-speciﬁc computing elements (e.g., intellectual property cores). We propose Shouji, a highly parallel and accurate pre-alignment ﬁlter designedon a specialized hardware platform. The ﬁrst key idea of our proposed pre-alignment ﬁlter is to provide high ﬁltering accuracy by correctly detecting allidentical subsequences shared between two given subsequences. This way leads toaddress the ﬁrst two causes of ﬁltering inaccuracy in SHD (i.e., random zeros andconservative counting). The second key idea is to avoid the ﬁltering inaccuracycaused by the leading and trailing zeros (as we discuss in Chapter 3). Shoujireplaces the vacant bits that result from shifting the read with ones.63 hapter 7MAGNET: Accurate HardwarePre-Alignment Filter

In this chapter, we introduce MAGNET, a pre-alignment ﬁltering algorithm toachieve highly accurate ﬁltering. MAGNET is a ﬁltering heuristic that aims atﬁnding all non-overlapping long streaks of consecutive zeros in the neighborhoodmap using a divide-and-conquer approach. We discuss the details of MAGNETalgorithm next.

MAGNET uses a divide-and-conquer technique to ﬁnd all the E +1 identical sub-sequences, if any, and summing up their length. By calculating their total length,we can estimate the total number of edits between the two given sequences. If thetotal length of the E +1 identical subsequences is less than m-E , then there existmore identical subsequences than E +1 that are associated with more edits thanallowed. If so, then MAGNET rejects the two given sequences without performingthe alignment step. The ﬁltering strategy of MAGNET makes three observationsbased on the pigeonhole principle to examine each mapping accurately.64. Given that the user-deﬁned edit distance threshold, E , is usually less than5% of the read length ( m ) [18, 65, 122, 62], the identical subsequence isusually long and ranges from a single pairwise-match to m pairwise-matcheslong.2. The length of the longest identical subsequence is strictly not less than (cid:100) (( m − E )) / (( E + 1)) (cid:101) and can be at most m pairwise-matches long (i.e.,equivalent to the sequence length). The upper bound is trivial and holdswhen the alignment is free of edits. The lower bound equality occurs whenall edits are equispaced and all E +1 subsequences are of the same length.3. We observe that the E +1 identical subsequences that are part of the cor-rect alignment are always non-overlapping. We also observe that the 2 E +1diagonal bit-vectors of the neighborhood map contain other identical sub-sequences that are always short. This raises a fundamental question aboutwhether the E +1 identical subsequences need to be strictly non-overlappingor only long and not necessarily non-overlapping. Next, we investigate bothcases by providing two algorithms for ﬁnding the longest identical subse-quences. MAGNET pre-alignment ﬁlter identiﬁes the incorrect mappings, without calcu-lating the optimal alignment, in three main steps. (1) It ﬁrst constructs the neigh-borhood map (described in Chapter 6). (2) It then identiﬁes all non-overlappingidentical subsequences in the neighborhood map using a divide-and-conquer ap-proach. (3) And ﬁnally it makes a decision (it accepts or rejects the given se-quences) based on the length of the found identical subsequences.65 .2.1 Method 1: Building the Neighborhood Map

The ﬁrst step of MAGNET algorithm is building our binary m by m neighborhoodmap that we describe in Chapter 6. Similarly with Shouji ﬁlter, MAGNET ﬁlterstarts with building the 2 E +1 diagonal bit-vectors of the neighborhood map forthe two given sequences. We use the neighborhood map to represent all the pair-wise matches between the read and the reference sequences. The neighborhoodmap considers also the presence of substitutions, insertions, and deletions. Nextstep is to ﬁnd the consecutive matches that are part of the correct alignment. E +1 Identical Subse-quences There are two methods to identify the identical subsequences. Either to ﬁnd thenon-overlapping subsequences of consecutive zeros or ﬁnd the E +1 top longestsubsequences of consecutive zeros. We describe both methods and show whichone is more eﬀective. E +1 non-overlapping subsequences Finding the E +1 non-overlapping subsequences in the neighborhood map involvesthree main steps.1. Extraction.

Each diagonal bit-vector nominates its local longest subse-quence of consecutive zeros. Among all nominated subsequences, a singlesubsequence is selected as a global longest subsequence based on its length.(Once found, MAGNET evaluates if its length is less than is strictly notless than (cid:100) (( m − E )) / (( E + 1)) (cid:101) , then the two sequences contains more ed-its than allowed, which cause the identical subsequences to be shorter (i.e.,each edit results in dividing the sequence pair into more identical subse-quences). If so, then the two sequences are rejected. Otherwise, MAGNET66tores its length to be used towards calculating the total length of all E +1identical subsequences.2. Encapsulation.

The next step is essential to preserve the original edit(or edits) that causes a single identical sequence to be divided into smallersubsequences. MAGNET penalizes the found subsequence by two edits (onefor each side). This is achieved by excluding from the search space of allbit-vectors the indices of the found subsequence in addition to the index ofthe surrounding single bit from both left and right sides.3.

Divide-and-Conquer Recursion.

In order to locate the other E non-overlapping subsequences, MAGNET applies a divide-and-conquer tech-nique where we decompose the problem of ﬁnding the non-overlapping iden-tical subsequences into two subproblems. While, the ﬁrst subproblem fo-cuses on ﬁnding the next long subsequence that is located on the right-handside of the previously found subsequence in the ﬁrst extraction step, the sec-ond subproblem focuses on the other side of the found subsequence. Eachsubproblem is solved by recursively repeating all the three steps mentionedabove, but without evaluating again the length of the longest subsequence.MAGNET applies two early termination methods that aim at reducing theexecution time of the ﬁlter. The ﬁrst method is evaluating the length of thelongest subsequence in the ﬁrst recursion call. The second method is limit-ing the number of the subsequences to be found to at most E +1, regardlesstheir actual number for each two sequences. E +1 longest subsequences Alternatively, MAGNET can be changed to ﬁnd only top E +1 longest sub-sequences without the restriction of being non-overlapping ones. This can beachieved by maintaining a binary max-heap priority queue [138], where it storesthe length of each of the E +1 subsequences from each mask.67 dit Distance ThresholdNon-overlappingTop LongestNon-overlappingAny Longest1 0.00% 0.00% 46 642 0.01% 0.09% 268 26743 0.04% 0.68% 1110 205044 0.08% 4.76% 2461 1429065 0.18% 18.85% 5457 565463 F a l s e A cc e p t R a t e Edit Distance Threshold (E)

Non-overlappingTop Longest

Figure 7.1: The false accept rate of MAGNET using two diﬀerent algorithmsfor identifying the identical subsequences. We observe that ﬁnding the E +1non-overlapping identical subsequences leads to a signiﬁcant reduction in theincorrectly accepted sequences compared to ﬁnding the top E +1 longest identicalsubsequences.In total, the queue stores up to ( E +1).(2 E +1) elements. The length of the toplongest subsequence is always stored at the root of the heap (the heap property).We need to extract the root of the heap structure E +1 times in order to ﬁnd thetotal length of the top E +1 longest subsequences. We evaluate both algorithmsfor ﬁnding the longest identical subsequences in Figure 7.1. We observe that theaccuracy of ﬁnding top E +1 longest subsequences degrades exponentially as theedit distance threshold increases. We also observe that the accuracy of ﬁndingthe non-overlapping subsequences remains almost linear and provides consider-ably more accurate ﬁltering. Thus, we consider identifying only non-overlappingsubsequences throughout the following sections. The last step of MAGNET is to decide if the mapping is potentially correct andneeds to be examined by read alignment. Once after the termination, if the totallength of all found identical subsequences is less than m-E then the two sequencesare rejected. Otherwise, they are considered to be similar and the alignment canbe measured using sophisticated alignment algorithms.68 ead : TTTTACTGTTCTCCCTTTGAATACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAATTACGTTTTAAAA Reference : TTTTACTGTTCTCCCTTTGAAATGACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAAATTACCGTTTTUpper Diagonal-4 : ----110111111100111111110101100001010001011010011111101101100110110011010101011101111111101011000000Upper Diagonal-3 : ---0110110101011111111111110111111111110010011110111111001000100100010011111110110111111000000110001Upper Diagonal-2 : --00111101100101101110110000000000000000000000000000000000000000000000000000000000000000010111110011Upper Diagonal-1 : -000111110111001001100011101111111111100100111101111110010001001000100111111101101111110111111110111Main Diagonal : 0000000000000000000001110110000101000101101001111110110110011011001101010101110111111111101111111111Lower Diagonal-1 : 000111110111001001101011010111111111011111011111101111111011111101111011111100001011010101101111111-Lower Diagonal-2 : 00111101100101101111011111100100010101110011100111011011111111111111010101111011010101001100111111--Lower Diagonal-3 : 0110110101011111111010110101111111011110111111111101101101111110111110111101111111111111110011111---Lower Diagonal-4 : 110111111100111110110001111100000101110101100111110010100111110011100100111101011011111111000111----MAGNET bit-vector : 000000000000000000000101000000000000000000000000000000000000000000000000000000000000000001000100000012 345

Figure 7.2: An example of applying MAGNET ﬁltering algorithm with an editdistance threshold of 4. MAGNET ﬁnds all the longest non-overlapping subse-quences of consecutive zeros in the descending order of their length (as numberedin yellow).We provide an example of applying MAGNET ﬁltering algorithm in Figure7.2.

In this section, we analyze the asymptotic run time and space complexity ofMAGNET algorithm. We provide the pseudocode of MAGNET in Algorithm7.1. We also provide the divide-and-conquer algorithm that MAGNET uses inAlgorithm 7.2. MAGNET applies a divide-and-conquer technique that dividesthe problem of ﬁnding the identical subsequences into two subproblems in eachrecursion call. In the ﬁrst recursion call, the extracted identical subsequenceis of length at least a = (cid:100) (( m-E )) / (( E + 1)) (cid:101) bases. This reduces the problem ofﬁnding the identical subsequences from m to at most m-a , which is further dividedinto two subproblems: a left subproblem and a right subproblem. For the sake ofsimplicity, we assume that the size of the left and the right subproblems decreasesby a factor of b and c , respectively, as follows: m = a + 2 + m/b + m/c (7.1)The addition of 2 bases is for the encapsulation bits added at each recursioncall. Now, let T MAGNET (m) be the time complexity of MAGNET algorithm, foridentifying non-overlapping subsequences.69 lgorithm 7.1 : MAGNET

Input : Seq1, Seq2, Edit distance threshold (E).

Output : Pseudocode:

1: m ← length(Seq1); 2: // Build Neighborhood map (N)

3: for i ← 1 to m do 4: for j ← i-E to i+E do 5: if Seq1[i] = Seq2[j] then 6: N[i,j] ← 0; 7: else 8: N[i,j]← 1; 9: for i ← 1 to m do 10: MAGNET[i] ← 1; 11: [MAGNET, calls] ← EXEN(N, 1, m, E, MAGNET, 1); 12: // Function CZ() returns number of zeros

13: if CZ(MAGNET) ≥ m-E then return 1; else return 0;

If it takes O( km ) time to ﬁnd the global longest subsequence and divide theproblem into two subproblems, where k = 2 E +1 is the number of bit-vectors, weget the following recurrence equation: T MAGNET (m) = T MAGNET (m/b) + T MAGNET (m/c) + O ( km ) (7.2)Given that the early termination condition of MAGNET algorithm restricts therecursion depth as follows: Recursion tree depth = (cid:100) log ( E + 1) (cid:101) − T MAGNET (m) = O ( km ) . (cid:100) log ( E +1) (cid:101)− (cid:88) x =0 (cid:18) c + 1 b (cid:19) x T MAGNET (m) ≈ O ( f km ) (7.4)70here f is a fraction number satisﬁes the following range: 1 ≤ f <

2. This inturn demonstrates that the MAGNET algorithm runs in a linear time with respectto the sequence length and edit distance threshold and hence it is computationallyinexpensive. The space complexity of MAGNET algorithm is as follows: D MAGNET (m) = D MAGNET (m/b) + D MAGNET (m/c) + O ( km + m ) D MAGNET (m) ≈ O ( f km + f m ) (7.5)Hence, MAGNET algorithm requires a linear space with respect to the readlength and edit distance threshold. Next, we outline the hardware implementationdetails of MAGNET ﬁlter. In this section, we outline the challenges that are encountered in implementingMAGNET ﬁlter to be used in our accelerator design. Implementing MAGNETalgorithm is challenging due to the random location and variable length of each ofthe E +1 identical subsequences. The Verilog-2011 imposes two challenges on ourarchitecture as it does not support variable-size partial selection and indexingof a group of bits from a vector [139]. In particular, the ﬁrst challenge lies inexcluding the extracted identical subsequence along with its encapsulation bitsfrom the search space of the next recursion call. The second challenge lies individing the problem into two subproblems, each of which has an unknown sizeat design time. To address these limitations and tackle the two design challenges,we keep the problem size ﬁxed and at each recursion call. We exclude the foundlongest subsequence from the search space by amending all bits of all 2 E +1 bit-vectors that are located within the indices (locations) of the encapsulation bits to‘1’s. This ensures to omit the found longest subsequence and all its correspondinglocations in the other bit-vectors from the following recursion calls.71 lgorithm 7.2 : EXEN function Function : EXEN() extracts the top longest subsequence of consecutive zeros and generate two subproblems.

Input : Neighborhood map ( N ), start index ( SI ), end index ( EI ), E , MAGNET bit-vector, number of recursion calls . Output : updated MAGNET bit-vector, updated number of calls . Pseudocode: // Early termination condition if (SI ≤ EI and calls ≤ E+1) then // Function CCZ() returns number and indices of longest // subsequence of diagonally consecutive zeros for j ← do //Extraction [X,s1,e1] ← CCZ(N[SI+j,SI],EI) ; // Lower diagonal [Y,s2,e2] ← CCZ(N[SI,SI+j],EI) ; // Upper diagonal if X > Y then s ← s1 ; e ← e1 ; 10: else s ← s2 ; e ← e2 ; 12: [X,s1,e1] ← CCZ(N[SI,SI],EI) ; 13: if X > (e-s+1) then s ← s1 ; e ← e1 ; 15: // Early termination condition (only in first call) if (calls=1 and (e-s+1)< ⌈(𝑚 − 𝐸)/(𝐸 + 1)⌉ ) then return [MAGNET, 0] ; 18: // Left subproblem with encapsulation [MAGNET, calls] ← EXEN(N,SI, s-2, E, MAGNET, calls+1) ; 20: // Right subproblem with encapsulation [MAGNET, calls] ← EXEN(N,e+2,EI, E,MAGNET, calls+1) ; 22: return [MAGNET, calls] ; 23: else return [MAGNET, calls-1] ; We introduce MAGNET, a new ﬁltering strategy that remarkably improves theaccuracy of pre-alignment ﬁltering and provides a minimal number of falselyaccepted mappings. 72AGNET gets rid of the ﬁrst three causes of ﬁltering inaccuracy that weobserved in SHD [65] (see Chapter 3). We believe that MAGNET is the mostaccurate pre-alignment ﬁlter in literature today but this comes at the expense ofhardware implementation challenges. 73 hapter 8SneakySnake: Fast, Accurate,and Cost-Eﬀective Filter

In this chapter, we address the issue of the long execution time of read alignmentusing a diﬀerent approach. The use of specialized hardware chips can yield sig-niﬁcant performance improvements. However, the main drawbacks are the higherdesign eﬀort, the limited data throughput, and the necessity of algorithm redesignand tailored implementation to fully leverage the FPGA architecture. To tacklethese challenges and obviate these diﬃculties, we introduce fast, accurate, andyet cost-eﬀective pre-alignment ﬁlter. We call it SneakySnake. It leverages thetoday’s general purpose processor (i.e., CPU), which is largely available to bioin-formaticians at no additional cost. Beside the CPU implementation of SneakyS-nake, we also provide an eﬃcient hardware architecture to further accelerate thecomputations of the SneakySnake algorithm with high parallelism.

The key idea of SneakySnake ﬁlter is that it needs to know only if two sequencesare dissimilar by more than the edit distance threshold or not.74his fact leads to the following question.

Can one approximate the editdistance between two sequences much faster than calculating the editdistance?

Given reference sequence A [1. . . m ], read sequence B [1. . . m ], and anedit distance threshold E , SneakySnake calculates the approximate edit distancebetween A and B and then checks if it is greater than the edit distance thresholdthen the two sequences A and B are rejected. Otherwise, the two sequences A and B are accepted by SneakySnake and the optimal alignment is calculated.The approximate edit distance calculated by SneakySnake should always be lessthan or equal to the edit distance threshold as long as the actual edit distancedoes not exceed the edit distance threshold. We convert the approximate edit distance problem to as a restricted optimal pathﬁnding problem in a grid graph. We call this the Sneaky Snake Problem. It canbe summarized as a traversal problem in a special grid graph, where a snaketravels in the grid graph with the presence of randomly distributed obstacles.The goal is to ﬁnd the path that connects the origin and the destination pointswith the minimal number of obstacles along the path.We describe the grid graph of the Sneaky Snake Problem as follows: • There is a two dimensional m by m grid graph, where m is the read length.The grid cells are aligned in rows and columns and all cells have the samecost. The snake wants to travel through the grid from the top left cornertowards the bottom right corner of the grid. There are E +1 entrances thatare next to the cells of the ﬁrst column and similarly, there are E +1 exitsthat are located next to the cells of the last column. • The snake is allowed to travel through dedicated pipes. Each pipe is repre-sented as a stretch of diagonally consecutive pairwise matches. We deﬁnethe path in this problem as a sequence of diagonal grid cells.75n other words, the movement through the vertical cells is not countedtowards the length of the path. The length of the path is simply the totalnumber of diagonal cells along the path. • The snake is only able to switch between the pipes after skipping an obstacle(i.e., end of the pipe). The obstacle represents a pairwise mismatch. Wemodel the obstacle as a solid black grid cell. Traveling across an obstaclerequires the snake to diagonally move one step ahead, and hence the obstaclecell is counted towards the total length of the path. The snake is allowed toavoid only E obstacles by transferring to another pipe or jumping over theobstacle within the same diagonal vector. This can be seen as the numberof E tries required to arrive the destination.The general goal of this problem is to ﬁnd a path in the grid graph with theminimum number of obstacles. The path starts at the ﬁrst leftmost cell andtravels out of the grid while exiting onto a destination. SneakySnake ﬁltering al-gorithm can approximate the edit distance using two diﬀerent approaches. Whileone of them relies on what we call weighted neighborhood maze , the other ap-proach relies on unweighted neighborhood maze . Both approaches involve threemain steps. Next, we describe the three steps of the ﬁrst approach. (1) It constructs theweighted neighborhood maze. (2) It identiﬁes the optimal travel path that hasthe least number of obstacles. (3) The mapping is accepted if and only if thesnake survives the grid maze. 76 .3.1 Method 1: Building the Weighted NeighborhoodMaze

The ﬁrst step in the SneakySnake algorithm is to build the m by m weightedneighborhood maze ( WN ). It is a modiﬁed version of our original neighborhoodmaze (see Chapter 6), where we change the meaning of the content of each cell.First, we change the value of the grid cell that represents a pairwise match fromzero to the total number of the consecutive zeros in its lower right neighborswithin the same diagonal vector. Second, we change the value of the grid cellthat represents a pairwise mismatch from one to zero (modeled as a solid blackcell). Given i and j (where 1 ≤ i ≤ m and i-E ≤ j ≤ i+E ), the entry WN [ i, j ]of the weighted neighborhood maze can be calculated as follows: WN [ i,j ] =  W N [ i + 1 , j + 1] + 1 A [ i ] = B [ j ]0 A [ i ] (cid:54) = B [ j ] (8.1)Computing each cell depends on its immediate lower right cell. This restrictsthe order of the computations to be performed starting from the lower rightcorner towards the upper left corner. We present in Figure 8.1 an example ofa weighted neighborhood maze for two sequences, where the sequence B diﬀersfrom the sequence A by three edits (i.e., obstacles). The second step of the SneakySnake algorithm is to decide on which pipe the snakeshould travel through, using three main steps. (1) The snake essentially uses theprecomputed weight of each cell in the weighted neighborhood maze to make thedecision. This is the key reason behind ﬁlling the weighted neighborhood mazefrom the lower right corner to the upper left corner, while the snake travels fromthe upper left corner towards the lower right corner.77

G G T G G A G A G A T C i1 G G T G A G A G T T G T E X I T E N T R A N C E Figure 8.1: Weighted Neighborhood maze ( WN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGT for E =3. Thesnake traverses the path that is highlighted in yellow which includes three ob-stacles and three pipes, that represents three identical subsequences: GGTG,AGAG, and TThe snake examines the ﬁrst cell of each diagonal vector in the ﬁrst column ofthe grid and ﬁnds the cell with the largest weight. (2) If the snake is unable todecide on which pipe to follow (i.e., all cells of the ﬁrst column are obstacles), thesnake skips the current column and consumes one attempt out of the E tries. Thesnake then ﬁnds the maximum weight of the cells in the next column. If it ﬁndsmore than one pipe with the same length, it always chooses the ﬁrst one (startingfrom the upper diagonal). (3) Once found, it travels through the pipe until itfaces an obstacle. It repeats these three steps until it reaches its destination orconsumes all the E tries. The last step is to check if the snake makes it through the grid maze or not. Givena limited number of tries (equals to E ), if the snake arrives the destination, thenthere exists a path that has at most E obstacles.78n the context of approximate edit distance calculation, it means that thereexists an alignment that has at most E edits. The SneakySnake algorithm con-siders this mapping as a correct one and accepts this mapping. Otherwise, itrejects the mapping without performing read alignment step. Here, we explain the three steps of solving the Sneaky Snake Problem using theunweighted neighborhood maze approach. (1) It ﬁrst constructs the unweightedneighborhood maze. (2) It then identiﬁes the optimal travel path that has theleast number of obstacles. (3) And ﬁnally the mapping is accepted if and only ifthe snake survives the grid maze without exceeding the limited number of tries.

In the unweighted neighborhood maze, we introduce two changes to the way webuild the neighborhood maze. First, each grid cell has no weight assigned to it.Second, we deﬁne a state for each grid cell, whereas a cell can be either in anavailable state or a blocked state. We set the grid cell to be in an available stateif it represents a pairwise match. If the grid cell represents a pairwise mismatch,then we set its state to blocked. In this way, the unweighted neighborhood mazeeliminates the data dependency between the grid cells that exists in the weightedneighborhood maze. Given i and j (where 1 ≤ i ≤ m and i-E ≤ j ≤ i+E ), thestate of the entry UN [ i, j ] of the unweighted neighborhood maze can be chosenas follows: UN [ i,j ] =  Available A [ i ] = B [ j ] Blocked A [ i ] (cid:54) = B [ j ] (8.2)79 G G T G G A G A G A T C i1 G G T G A G A G T T G T E X I T E N T R A N C E Figure 8.2: Unweighted neighborhood maze ( UN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGT for E =3. Thesnake traverses the path that is highlighted in yellow which includes three ob-stacles and three pipes, that represents three identical subsequences: GGTG,AGAG, and T.We present in Figure 8.2 an example of an unweighted neighborhood maze fortwo sequences, where the sequence B diﬀers from the sequence A by three edits.The state of the entry UN [ i,j ] is set to available if the i th character of the readsequence matches the j th character of the reference sequence. Otherwise, it is setto blocked (highlighted in a black). The second step of the SneakySnake algorithm is to decide on which pipe the snakeshould consider, in three main steps. (1) As the snake has no prior informationon the length of each pipe (as in the weighted neighborhood maze), it uses itstelescoping lens to perform depth-ﬁrst search (DFS) [140] each time it faces ablocked cell (i.e., obstacle). The DFS algorithm traverses the ﬁrst upper diagonalvector to reach the maximum possible depth (i.e., until it arrives a blocked cellor the end of the vector). If the DFS algorithm reaches the exit of the maze, thenSneakySnake terminates the DFS search.80he depth of each grid cell is equal to the value of its j index. It stores themaximum possible depth and backtracks to the starting point (i.e., the obstaclecell). It then continues traversing the next unsearched diagonal vector and repeatsthe previous steps. (2) The snake selects the path with the largest depth. If itis unable to ﬁnd one (i.e., all cells of the ﬁrst column are obstacles), it skips thecurrent column and consumes one path transfer out of the E allowed transfers.(3) If it ﬁnds multiple equi-length pipes, it always chooses the ﬁrst one (startingfrom the upper diagonal). Once found, it travels through the pipe until it facesanother obstacle. It repeats these three steps until it reaches its destination orconsumes all the E tries. The last step is to accept the mapping based on whether the snake arrives itsdestination given a limited number of path transfers (equals to E ). This meansthat there is an alignment for the two given sequences that has at most E edits.Otherwise, The SneakySnake algorithm rejects the mapping without performingread alignment step. In this section, we analyze the asymptotic run time and space complexity ofthe SneakySnake algorithm. The SneakySnake algorithm builds the weightedneighborhood maze by traversing through each vector starting from the lowerright corner. It compares the corresponding characters of the two given sequencesand if they match each other, then it updates the current cell with the result ofsumming up one and the value of the previous cell. Assuming it takes O( m ) timeto build one diagonal vector, where m is the read length. Then it takes O( km )to build the entire weighted maze, where k = 2 E +1 is the number of the gridvectors. 81pon arriving an obstacle, the SneakySnake algorithm examines the weight ofeach cell following the obstacle cell and picks the cell with the largest weight asa new starting point. With the existence of E obstacles, this step is repeated atmost E times. Each time it takes O( k ) time to compare the weight of k cells.Thus, the upper-bound on the time complexity of SneakySnake using weightedneighborhood maze is given as follows: T W eighted SneakySnake (m) = O ( Em + hk ) (8.3)Where h is a number satisﬁes the following range: 0 ≤ h ≤ E .Using the unweighted neighborhood maze, the SneakSnake algorithm doesnot necessarily traverse all the cells of each diagonal vector (as in the weightedapproach described above). Thus, its asymptotic run time is not determined.On the one hand, the lower-bound on the its time complexity is O( m ), which isachieved when the DFS algorithm reaches the exist of the maze without facingany obstacle along the path. On the other hand, the loose upper-bound run timecomplexity is also equal to O( km+hk ) when the DFS traverses through nearlythe entire unweighted neighborhood maze. However, it is unrealistic to traversethe entire maze, as in this case the value of each and every cell of the entire mazeshould be equal to ‘0’ (for example, when all characters of the two sequencesare ‘A’s). If this is the case, then the DFS algorithm traverses only through theﬁrst upper diagonal and then get terminated. Thus, the run time complexity ofSneakySnake with the unweighted neighborhood maze is given as follows: O ( m ) ≤ T Unweighted SneakySnake (m) < O ( km + hk ) (8.4)Where h is a number satisﬁes the following range: 0 ≤ h ≤ E . This in turndemonstrates that the unweighted neighborhood maze algorithm is asymptoti-cally inexpensive compared to the weighted SneakySnake algorithm. Hence, weconsider the unweighted SneakySnake algorithm for further analysis and eval-uation. We provide the pseudocode of SneakySnake in Algorithm 8.1. Whilethe weighted SneakySnake requires no additional auxiliary space, the weightedapproach requires only storing the depth of at most 2 E +1 vectors.82 lgorithm 8.1 : SneakySnake Input : Seq1, Seq2, Edit distance threshold (E).

Output : Pseudocode: m ← length(Seq1) ; 2: // Build unweighted Neighborhood maze (UN) for i ← do for j ← i-E to i+E do if Seq1[i] == Seq2[j] then N[i,j] ← ; else N[i,j] ← ; // Find Best Path with least number of obstacles i ←

1; e ← while i ≤ m and e ≤ E do

11: LargestDepth = 0; 12: for j ← do count ← ii ← //UN_Diagonal[diagonal id][cell index] while UN_Diagonal[j][i+ii] == 0 do Depth ← i; ii++; if (Depth>LargestDepth) LargestDepth ← Depth ; 20: i ← LargestDepth + 1 ; 21: e ← e + 1 ; 22: // Examine the snake arrival if e-1 ≤ E then return ; 23: else return ; We introduce an FPGA-friendly architecture for the unweighted SneakySnakealgorithm. Achieving an eﬃcient hardware architecture raises the following ques-tion:

Can one solve many small sub-problems of the Sneaky SnakeProblem with a high parallelism by reducing the search space of theSneakySnake algorithm? t columns (the height is always ﬁxed to 2 E +1 rows) forsome parameter t . We then apply the SneakySnake algorithm to each sub-matrixindependently from the other sub-matrices. This results in three key beneﬁts.First, downsizing the search space into a reasonably small sub-matrix with aknown dimension at the design time limits the number of all possible solutionsfor that sub-matrix. This reduces the size of the look-up tables (LUTs) requiredto build the architecture and simpliﬁes the overall design. Second, this approachhelps to maintain a modular and scalable architecture that can be implementedfor any read length and edit distance threshold. Third, all the smaller sub-problems can be solved independently and rapidly with a high parallelism. Thisreduces the execution time of the overall algorithm as the SneakySnake algorithmdoes not need to evaluate the entire path. However, these beneﬁts come at the costof accuracy degradation. The solution for each sub-problem is not necessarily partof the solution for the main problem (with the original size of m by m ). Though itis guaranteed to always choose the path with the least obstacles within the searchspace, our hardware architecture can underestimate the number of obstacles foundand thereby increase the false accept rate.Next, we present the details of our hardware architecture. We choose theparameter t to be 8 columns. This results in partitioning the unweighted neigh-borhood maze of size 100 by 100 (or 2 E +1 by 100 computed cells) into 13 sub-matrices, each of size 2 E +1 by 8. Each row in the sub-matrix is part of the di-agonal vector of the unweighted neighborhood maze. Each sub-matrix representsan individual Sneaky Snake Problem. Solving each problem requires determiningthe optimal path (with the least number of obstacles) along each sub-matrix usingﬁve main steps. (1) The ﬁrst step is to perform the DFS search for ﬁnding theoptimal path along the 2 E +1 rows of the sub-matrix. We implement the DFSalgorithm on FPGA as a leading-zero counter (LZC). We use the LZC designproposed in [141]. It counts the number of consecutive zeros that appear in an-bit input word before the ﬁrst more signiﬁcant bit that is equal to one.84 alidC C C Valid_AValid_BCA CA CA CB CB CB CA < CB? 1:0 CA CB CA CB CA CB Figure 8.3: Proposed 4-bit LZC comparator.It generates two output signals, the log n bits of the leading-zero count C anda ﬂag Valid , for an input word A = A n − , A n − , . . . , A , where A n − is the most-signiﬁcant bit. When all input bits are set to zero, the Valid ﬂag is set to one.For other cases of input, the value of C represents the number of leading zeros.For the case of an 8-bit input operand, the two output signals of the LZC aregiven by: C = A + A + A + A C = A + A + A · A · ( A + A ) C = A + A · A + A · A · A + A · A · A · A V alid = A + A + A + A + A + A + A + A (8.5)(2) The second step is to ﬁnd the row that has the largest number of leading zeros.We build a hierarchical comparator structure with log (2 E +1) levels. We use asingle LZC comparator for comparing the number of leading zeros of two rows.The LZC comparator compares two numbers of leading zeros produced by twoLZC circuits and passes the largest number as output signals without changingtheir values (maintaining the same meaning). We provide in Figure 8.3 the pro-posed architecture of the 4-bit LZC comparator. For an edit distance thresholdsof 5 bp, we need 12 LZC comparators arranged into 4 levels (6, 3, 2, and 1 LZCcomparators in each level, respectively). We provide the overall architecture ofthe 4-level LZC comparator tree including the 11 LZC block diagrams in Figure8.4. 85

ZC1LZC2LZC3LZC4LZC5LZC6LZC7LZC8LZC9LZC10LZC11 LZCComp. 1LZCComp. 2LZCComp. 3LZCComp. 4LZCComp. 5LZCComp. 6 LZCComp. 7LZCComp. 8LZCComp. 9 LZCComp. 10 LZCComp. 12LZCComp. 11 C C C Valid

Figure 8.4: Block diagram of the 11 LZCs and the hierarchical LZC comparatortree for computing the largest number of leading zeros in 11 rows.863) The third step is to store the largest number of leading zeros and shift thebits of all rows to the right direction. As we implement the DFS search as a LZCcircuit, we always need to start the DFS search space from the least signiﬁcantbit of each row. The shift operation is necessary for deleting the leading zerosand allowing the snake to continue the process of ﬁnding the optimal path. Weshift each row of the sub-matrix by x +1 bits, where x is the largest number ofleading zeros found in step 2. This guarantees to exclude the found path fromthe next search round along with the edit, which divides the optimal path intoshorter paths.(4) The fourth step is to repeat the previous three steps in order to ﬁnd theoptimal path from the least signiﬁcant bit all the way to the most signiﬁcantbit. The number of replications needed depends on the desired accuracy of theSneakySnake algorithm. If our target is to ﬁnd at most a single edit within eachsub-matrix, then we need to build two replications for the steps described above.For example, let A be 00010000, where t = 8. The ﬁrst replication computesthe value of x as four zeros and updates the bits of A to 11111000. The secondreplication computes the value of x as three zeros and updates the bits of A to11111111.(5) The last step is to calculate the total number of obstacles along the entireoptimal path from the least signiﬁcant bit towards the most signiﬁcant bit. Weﬁrst ﬁnd out the number of replications that produces at least a single leadingzero (i.e., x ¿0). If it equals the total number of replications ( y ), then the numberof obstacles (edits) equals to the total number of replications included in thedesign. Otherwise, we compute the number of the obstacles as follows: min ( y, t − y (cid:88) o =1 x o ) (8.6)where y is the total number of replications involved in the design and x o is thelargest number of leading zeros produced by the replication of index o .87 .7 Discussion We introduce the SneakySnake algorithm and hardware architecture in order toaddress the question that we raise earlier in this chapter about weather approxi-mating the edit distance calculation can be faster than calculating the exact editdistance. The accuracy of the edit distance approximation is another concern.Our SneakySnake ﬁlter always underestimates the total number of edits. Thisis mainly due to the goal of the Sneaky Snake Problem that is the search foroptimal path with the least number of obstacles (or edits). This concern raisestwo key questions. (1)

Can one improve the accuracy of edit distanceapproximation such that we achieve either a new optimal read aligneror a highly-accurate pre-alignment ﬁlter?

The SneakySnake algorithm canbe slightly modiﬁed such that it considers a penalty on the selected path. If thesnake selects the upper diagonals, then this means that the obstacle is basicallya deleted character at the i th index of the read sequence. Similarly, if the currentpath is at the main diagonal and the next selected path is at the second upperdiagonal, this means that the obstacle is two deleted characters. The currentimplementation of SneakySnake considers this two deleted characters as a singleedit. The accuracy improvement of SneakySnake is yet to be explored in thisthesis.The second question is (2) Similarly to the way we partition the search space ofthe SneakySnake algorithm, Can one reduce the search space of exact editdistance algorithms such that only the necessary cells are computed?

We observe the fact that the exact edit distance algorithms explore a large areaof the dynamic programming matrix (even with the banded matrix), which isunnecessary for highly dissimilar sequences. The edit distance is considered tobe a non-additive distance measure [30]. This means that its calculations cannot be distributed over concatenated subsequences of the long sequence. In otherwords, we can not divide the read-reference pair into shorter pairs and calculatethe exact edit distance for each short read-reference pair individually (aiming atconcurrently computing them) and then accumulate the results.88or example, take A = “CGCG”, B = “GCGC” and observe that calculatingthe edit distance for each character of A with its corresponding character of B yields four edits, while the edit distance between A and B , each as a whole, isonly 2 edits. This is clear from the way that its dynamic programming matrixis computed. The computations always depend on the preﬁxes of both the readand the reference sequences. As a workaround, we can compute the exact editdistance in an incremental approach. One can divide the dynamic programmingmatrix into sub matrices, where each smaller sub-matrix is fully covered by allthe larger sub-matrices. Such that each sub-matrix UN s strictly starts from theindex i of value 1 up to some s, where the following equation holds 1 ≤ i ≤ s ≤ m and i-E ≤ j ≤ i+E . We refer to this edit distance measure as a preﬁx editdistance . In the next chapter, we evaluate all these scenarios in details. In this chapter, we introduce the Sneaky Snake Problem, and we show how anapproximate edit distance problem can be converted to an instance of the SneakySnake Problem. Subsequently, we propose a new pre-alignment ﬁltering algo-rithm (we call it SneakySnake) that obviates the need for expensive specializedhardware. The solution we provide is cost-eﬀective given a limited resources envi-ronment. Our algorithm does not exploit any SIMD-enabled CPU instructions orvendor-speciﬁc processor. This makes it superior and attractive. We also provideeﬃcient and scalable hardware architecture along with several design optimiza-tions for the SneakySnake algorithm. Finally, we discuss several optimizationsand challenges of accelerating both approximate and exact edit distance calcula-tions. 89 hapter 9Evaluation

In this chapter, we evaluate the FPGA resource utilization, the ﬁltering accu-racy, and the memory utilization of all our proposed pre-alignment ﬁlters. Wealso investigate the beneﬁts of using our hardware and CPU-based pre-alignmentﬁltering solutions along with the state-of-the-art aligners. We compare the per-formance of our proposed pre-alignment ﬁlters (SneakySnake, MAGNET, Shouji,and GateKeeper) with the state-of-the-art existing pre-alignment ﬁlter, SHD [65]and read aligners, Edlib [78], Parasail [54], GSWABE [57], CUDASW++ 3.0 [59],and FPGASW [53]. We run all experiments using 3.6 GHz Intel i7-3820 CPUwith 8 GB RAM. We use a Xilinx Virtex 7 VC709 board [133] to implement ouraccelerator architecture and our hardware ﬁlters. We build the FPGA designsusing Vivado 2015.4 in synthesizable Verilog.

Our experimental evaluation uses 12 diﬀerent real datasets. Each dataset contains30 million real sequence pairs. Next, we elaborate on how we obtain them.90able 9.1: Benchmark illumina-like read sets of whole human genome, obtainedfrom EMBL-ENA.

Accession no. ERR240727_1 SRR826460_1 SRR826471_1Read length (bp)

100 150 250

No. of reads

HTS

Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000

Table 9.2: Benchmark illumina-like datasets (read-reference pairs). We map eachread set, described in Table 9.1, to the human reference genome in order togenerate four datasets using diﬀerent mapper’s edit distance thresholds (using -e parameter). Accession no.Dataset no. mrFAST -e ERR240727_1 SRR826460_1 SRR826471_1 -e parameter of mrFAST to generate four real datasets. We summarize thedetails of these 12 datasets in Table 9.2. For the convenience of referring to thesedatasets, we number them from 1 to 12 (e.g., set 1 represents 30 million readsfrom ERR240727 1 mapped with -e = 2 edits). The 12 real datasets enable usto measure the eﬀectiveness of the ﬁlters in tolerating low number of edits andfar more edits than the allowed edit distance threshold. We provide detailed in-formation on the number of correct and incorrect pairs of each of the 12 datasetsfor diﬀerent user-deﬁned edit distance thresholds in Table A.1, Table A.2, andTable A.3 in Appendix A. 91 .2 Resource Analysis We now examine the FPGA resource utilization for the hardware implementationof GateKeeper, Shouji, MAGNET, and SneakySnake pre-alignment ﬁlters. Weprovide several hardware designs for two commonly used edit distance thresholds,2 bp and 5 bp as reported in [18, 65, 122, 62], for a sequence length of 100 bp.The VC709 FPGA chip contains 433,200 slice LUTs (look-up tables) and 866,400slice registers (ﬂip-ﬂops). Table 9.3 lists the FPGA resource utilization for asingle ﬁltering unit. We make ﬁve main observations.1. The design for a single MAGNET ﬁltering unit requires about 10.5% and37.8% of the available LUTs for edit distance thresholds of 2 bp and 5bp, respectively. Hence, MAGNET can process 8 and 2 sequence pairsconcurrently for edit distance thresholds of 2 bp and 5 bp, respectively,without violating the timing constraints of our hardware accelerator.2. The design for a single Shouji ﬁltering unit requires about 15x-21.9x lessLUTs compared to MAGNET. This enables Shouji to achieve more par-allelism over MAGNET design as it can have 16 ﬁltering units within thesame FPGA chip.3. GateKeeper requires about 26.9x-53x and 1.7x-2.4x less LUTs compared toMAGNET and Shouji, respectively. GateKeeper can also examine up to 16sequence pairs at the same time.4. SneakySnake requires 15.4x-26.6x less LUTs compared to MAGNET. WhileSneakySnake requires a slightly less LUTs compared to Shouji, it requiresabout 2x more LUTs compared to GateKeeper. SneakySnake can also ex-amine up to 16 sequence pairs concurrently.5. We observe that the hardware implementations of Shouji, MAGNET, andSneakySnake require pipelining the design (i.e., shortening the critical pathdelay of each processing core by dividing it into stages or smaller tasks) toenable meeting the timing constraints and achieve more parallelism.92able 9.3: FPGA resource usage for a single ﬁltering unit of GateKeeper, Shouji,MAGNET, and SneakySnake for a sequence length of 100 and under diﬀerentedit distance thresholds ( E ). E (bp) Slice LUT Slice Register No. of Filtering Units ShoujiMAGNETGateKeeperSneakySnake

We build 8 pipeline stages for Shouji, 22 pipeline stages for MAGNET, and5 pipeline stage for SneakySnake to satisfy the timing constraints. How-ever, pipelining the design comes with the expense of increased registerutilization.We conclude that the FPGA resource usage is correlated with the ﬁlteringaccuracy. For example, the least accurate ﬁlter, GateKeeper, occupies the leastFPGA resource that can be integrated into the FPGA. We also conclude that theless the logic utilization of a single ﬁltering unit, the more the number of ﬁlteringunits.

Next, we assess the false accept rate and false reject rate of GateKeeper, Shouji,MAGNET, and SneakySnake across our 12 datasets. We also investigate andaddress several concerns that we raise in Chapter 8. We compare the accuracyperformance of our proposed pre-alignment ﬁlters with the best performing ex-isting pre-alignment ﬁlter, SHD [65]. 93HD supports a sequence length of up to only 128 characters (due to theSIMD register size). To ensure as fair a comparison as possible, we allow SHDto divide the long sequences into batches of 128 characters, examine each batchindividually, and then sum up the results. As we describe in Chapter 3, we aimto minimize the false accept rate so that the elimination of dissimilar sequencesis maximized. We also aim to maintain a 0% false reject rate. We use Edlib [78](set to edit distance mode) to generate the ground truth edit distance value foreach sequence pair as it has a zero false accept rate and a zero false reject rate.

We raise two key questions in Chapter 8. (1) Can one approximate the editdistance between two sequences much faster than calculating the edit distance?(2) Can one reduce the search space of approximate and exact edit distance al-gorithms? To answer these two questions, we ﬁrst evaluate the performance ofSneakySnake (approximate edit distance algorithm) and then examine the fea-sibility of reducing its search space without causing falsely-rejected mappings.Secondly, we examine the ability to implement the best performing existing exactedit distance algorithm, Edlib [78] such that it calculates the preﬁx edit distance.As we discuss in Chapter 8, we column-wise partition the unweighted neighbor-hood maze of SneakySnake algorithm into adjacent non-overlapping sub-matricesof the same size (2 E +1 by t ). In Figure 9.1, we illustrate the eﬀects of this par-titioning on the false accept rate and the execution time of our SneakySnakealgorithm. We make two observations.1. Partitioning the search space of SneakySnake with a partitionsize of 5 ( t =5) reduces its execution time by up to 5.12x (Figure9.1a), 7.4x (Figure 9.1c), and 13.2x (Figure 9.1e) at the expenseof increased false accept rate by up to 55.4x (Figure 9.1 b), 43.5x (Figure9.1d), and 67.3x (Figure 9.1f). 94. There is a trade-oﬀ between the speed and the accuracy of SneakySnakealgorithm. For example, the least accurate ﬁlter, SneakySnake with a parti-tion size of 5 columns (or in short SneakySnake-5) yields the fastest speed.Next, we assess the eﬀect of the number of replications on the ﬁltering accuracyof the hardware implementation of the SneakySnake algorithm. We use a sub-matrix’s width of 8 columns ( t =8) and we vary the height of the sub-matrix from1 row (i.e., E =0 bp) up to 21 rows (i.e., E =10 bp). Based on Figure 9.2, wemake two observations:1. We observe that increasing the number of the replications in the designimproves the ﬁltering accuracy of the SneakySnake algorithm. This obser-vation is in accord with our expectation as each replication detects at mosta single edit within each sub-matrix.2. We also observe that the hardware implementation of SneakySnake using3 replications (3 iterations for ﬁnding the optimal path within each sub-matrix) achieves a similar accuracy performance (or slightly better) as thatof the SneakySnake-5.We conclude that partitioning the search space of the SneakySnake algorithmis also beneﬁcial for building an eﬃcient hardware architecture while maintaininghigh ﬁltering accuracy.Now, we modify Edlib algorithm such that it applies a preﬁx edit distance (weprovide the deﬁnition in Chapter 8). It starts computing only a small squaresub-matrix. If the computed edit distance meets the user-deﬁned edit distancethreshold, then it extends the sub-matrix into a larger square one (which overlapsentirely with the smaller one) by increasing the number of columns and rows bya constant. For example, if the initial sub-matrix size is 5 columns by 5 rows,then we extend it into a larger one of size 10 columns by 10 rows, and next weextend it into a sub-matrix of size 15 columns by 15 rows.95 %8%15%23%30% 0 1 3 4 6 7 9 10 12 13 15 F a l s e A cc e p t R a t e Edit Distance ThresholdSneakySnake-150SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 0495990148519802475 0 1 3 4 6 7 9 10 12 13 15 P r e - a li g n m e n t F il t e r i n g T i m e ( s e c o nd s ) Edit Distance ThresholdSneakySnake-150SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-50%5%10%15%20% 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSneakySnake-250SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 0100020003000400050006000 0 2 5 7 10 12 15 17 20 22 25 P r e - a li g n m e n t F il t e r i n g T i m e ( s e c o nd s ) Edit Distance ThresholdSneakySnake-250SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 (c) (d)(e) (f) F a l s e A cc e p t R a t e Edit Distance ThresholdSneakySnake-100SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 02755508251100 0 1 2 3 4 5 6 7 8 9 10 P r e - a li g n m e n t F il t e r i n g T i m e ( s e c o nd s ) Edit Distance ThresholdSneakySnake-100SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 (a) (b)

Figure 9.1: The eﬀects of column-wise partitioning the search space of SneakyS-nake algorithm on the average false accept rate ((a), (c), and (e)) and the averageexecution time ((b), (d), and (f)) of examining set 1 to set 4 in (a) and (b), set 5to set 8 in (c) and (d), and set 9 to set 12 in (e) and (f). Besides the default size(equals the read length) of the SneakySnake’s unweighted neighborhood maze,we choose partition sizes (the number of grid’s columns that are included in eachpartition) of 5, 10, 25, and 50 columns.96 %20%40%60%80% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-5 0%20%40%60% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-50%4%8%12%16%20% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-5 0%1%1%2%2% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-5 (a) (b)(c) (d)

Figure 9.2: The eﬀects of the number of replications of the hardware implemen-tation of SneakySnake algorithm on its ﬁltering accuracy (false accept rate). Weuse a wide range of edit distance thresholds (0 bp-10 bp for a read length of 100bp) and four datasets: (a) set 1, (b) set 2, (c) set 3, and (d) set 4.We keep extending the size of the sub-matrix until we cover the entire dynamicprogramming matrix or the preﬁx edit distance exceeds the edit distance thresh-old. In Figure 9.3, we present the eﬀects of computing the preﬁx edit distance(partitioning the edit distance matrix) on the overall execution time. We alsoprovide the performance of the original and the partitioned SneakySnake algo-rithm for a better comparison. We provide more detailed results in Table A.4,Table A.5, and Table A.6 in Appendix A. We make two key observations.1.

Our SneakySnake algorithm with a partition size of 5(SneakySnake-5) is up to 25.5x (Figure 9.3a), 52.5x (Figure 9.3b),and 94.5x (Figure 9.3c) faster than the best performing edit dis-tance algorithm, Edlib [78] .2. Preﬁx edit distance with large enough initial sub-matrix size provides aslight reduction in the execution time of Edlib.97e observe that setting the initial sub-matrix size to 50 bp provides thehighest reduction in the Edlib’s execution time over a wide range of editdistance thresholds. Edlib with an initial sub-matrix size of 50 bp (or inshort Edlib-50) is up to 1.9x (Figure 9.3a), 4.4x (Figure 9.3b), and 7.8x(Figure 9.3c) faster than the original Edlib. However, SneakySnake-5 isstill up to an order of magnitude (19.8x) faster than Edlib-50 when the editdistance threshold is set to 9 (for m =100) or 10 (for m =150 or 250) andbelow, as highlighted with a dashed vertical line in Figure 9.3. Note that E is typically less than or equal 5% of the read length [18, 65, 122, 62].We conclude that reducing the search space of both approximate and exactedit distance algorithms is beneﬁcial. Combining SneakySnake-5 and Edlib-50can provide a fast examination for incorrect mappings across a wide range of editdistance thresholds (0% - 10% of the read length). We present in Figure 9.4, Figure 9.5, and Figure 9.6 the false accept rate of ourpre-alignment ﬁlters compared to SHD for read lengths of 100 bp, 150 bp, and250 bp, respectively. We make six key observations.1. We observe that Shouji, MAGNET, GateKeeper, and SneakySnake are lessaccurate in examining the low-edit sequences (Figure 9.4a,b, Figure 9.5a,b,and Figure 9.6a,b) than the edit-rich sequences (Figure 9.4c,d, Figure 9.5c,d,and Figure 9.6c,d). While SneakySnake pre-alignment ﬁlter yields the high-est accuracy, SHD [65] and GateKeeper provide the least accuracy comparedto all other pre-alignment ﬁlters. The slope of MAGNET plot is almostcomparable to that of the SneakySnake (with the default maze size).98 E x e c u t i o n T i m e o f E d li b ( s e c nd s ) Edit Distance ThresholdEdlib-100Edlib-50Edlib-25Edlib-10Edlib-5SneakySnake-100SneakySnake-5021643264886410801296 0 1 3 4 6 7 9 10 12 13 15 E x e c u t i o n T i m e o f E d li b ( s e c nd s ) Edit Distance ThresholdEdlib-150Edlib-50Edlib-25Edlib-10Edlib-5SneakySnake-150SneakySnake-50400800120016002000 0 2 5 7 10 12 15 17 20 22 25 E x e c u t i o n T i m e o f E d li b ( s e c nd s ) Edit Distance ThresholdEdlib-250Edlib-50Edlib-25Edlib-10Edlib-5SneakySnake-250SneakySnake-5 (c) (a) (b)

Figure 9.3: The eﬀects of computing the preﬁx edit distance on the overall execu-tion time of the edit distance calculations compared to the original Edlib (exactedit distance) and our partitioned implementation of SneakySnake algorithm. Wepresent the average time spent in examining set 1 to set 4 in (a), set 5 to set 8in (b), and set 9 to set 12 in (c). We choose initial sub-matrix sizes of 5, 10, 25,and 50 columns. We mark the intersection of SneakySnake-5 and Edlib-50 plotswith a dashed vertical line. 99. GateKeeper and SHD [65] become ineﬀective for edit distance thresholdsof greater than 8% for m = 100 bp (Figure 9.4), 5% for m = 150 bp(Figure 9.5), and 3% for m = 250 bp (Figure 9.6) for both low-edit andedit-rich sequences, where m is the read length. This leads to examiningeach sequence pair twice unnecessarily, by both GateKeeper or SHD andthe full alignment step.3. Shouji provides up to 17.2x, 73x, and 467x less false accept rate comparedto GateKeeper and SHD for read lengths of 100 bp, 150 bp, and 250 bp,respectively.4. MAGNET, SneakySnake (with the default maze size), and SneakySnake-5show a slow exponential degradation in their ﬁltering inaccuracy for low-editsequences with a false accept rate of up to 50%, 30%, and 70%, respectively,as we show in Figure 9.4a,b, Figure 9.5a,b, and Figure 9.6a,b. They alsoshow almost a linear growth in their false accept rate of less than 4% foredit-rich sequences of diﬀerent read lengths, as we show in Figure 9.4c,d,Figure 9.5c,d, and Figure 9.6c,d.5. MAGNET shows up to 1577x, 3550x, and 25552x less false accept ratecompared to GateKeeper and SHD for read lengths of 100 bp, 150 bp, and250 bp, respectively. MAGNET also provides up to 205x, 951x, and 16760xless false accept rate compared to Shouji for read lengths of 100 bp, 150 bp,and 250 bp, respectively.6. SneakySnake (with the default maze size) produces up to four and ﬁveorders of magnitude less false accept rate compared to Shouji and Gate-Keeper, respectively. SneakySnake shows up to 55.4x, 46.7x, and 67.1x lessfalse accept rate compared to SneakySnake-5 for read lengths of 100 bp,150 bp, and 250 bp, respectively. SneakySnake also shows up to 64.1x, 16x,and 22x less false accept rate compared to MAGNET for read lengths of100 bp, 150 bp, and 250 bp, respectively.We conclude that SneakySnake, MAGNET, and Shouji are very eﬀective andsuperior to the state-of-the-art pre-alignment ﬁlter, SHD [65] in both situations100 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-5 00.20.40.60.81 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-500.20.40.60.81 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-500.20.40.60.81 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-5 (a) (b)(c) (d)

Figure 9.4: The false accept rate produced by our pre-alignment ﬁlters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to the best performingﬁlter, SHD [65]. We use a wide range of edit distance thresholds (0-10 edits fora read length of 100 bp) and four datasets: (a) set 1, (b) set 2, (c) set 3, and (d)set 4.(low-edit and edit-rich mappings). They maintain a very low rate of falsely-accepted incorrect mappings and signiﬁcantly improves the accuracy of pre-alignment ﬁltering by up to ﬁve orders of magnitude compared to GateKeeperand SHD.

Using our 12 low-edit and edit-rich datasets for three diﬀerent read lengths, weobserve that SneakySnake (for all partition sizes), Shouji, GateKeeper, and SHDdo not ﬁlter out correct mappings; hence, they provide a 0% false reject rate.The reason behind that is the way we ﬁnd the identical subsequences.101 c) (d) F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-5 00.20.40.60.81 0 1 3 4 6 7 9 10 12 13 15 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-5 (a) (b) F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-50%20%40%60%80%100% 0 1 3 4 6 7 9 10 12 13 15 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-5

Figure 9.5: The false accept rate produced by our pre-alignment ﬁlters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to the best performingﬁlter, SHD [65]. We use a wide range of edit distance thresholds (0-15 edits fora read length of 150 bp) and four datasets: (a) set 5, (b) set 6, (c) set 7, and (d)set 8. 102 c) (d) F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-5 00.20.40.60.81 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-500.20.40.60.81 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-5 00.20.40.60.81 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-5 (a) (b)

Figure 9.6: The false accept rate produced by our pre-alignment ﬁlters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to the best performingﬁlter, SHD [65]. We use a wide range of edit distance thresholds (0-25 edits fora read length of 250 bp) and four datasets: (a) set 9, (b) set 10, (c) set 11, and(d) set 12. 103 ead : CAAACTGGGTGGAGCCCACCACAGCTCAAAGGAAGCCTGCCTTCCTCTGTAGGCTCCACCTCTGGGGGCAGGGCACAGACAAACAAAAAGACAGCAGTAA Reference : CAAACTGGGTGGAGCCCACAACAGCTCAAGGAGGCCTGCCTGCCTCTATAGGCTCCACCTCTGGGGGCAGGGCACAGACAAACAAAAAGACAGCAGTAACUpper Diagonal-6 : ------1111111011111110110111001111111011110110110111001101111111111010001100011101110100111011001101Upper Diagonal-5 : -----11111101011101110010100111111111111111110011111111111000111111110001111010010101000101011111010Upper Diagonal-4 : ----011110001111110111111111111011111110111011011111101110110111111110000011011101110001101011011111Upper Diagonal-3 : ---1111111001011110100110111111010111000000001110110111111011111110110011111100010000010101001111101Upper Diagonal-2 : --10111101011011010010011101111000111101110100111101111010010111100110111111111101100100101110001011Upper Diagonal-1 : -100111001101110011111111111011101111111111110010111110110110011000111101100101010101000101011111111Main Diagonal : 0000000000000000000100000000010111101110111011111110111011011110000111001111111100110000111111111101Lower Diagonal-1 : 100111001101110011001111111000001000000001000001000000000000000000000000000000000000000000000000000-Lower Diagonal-2 : 10111101011011010111011101101010010111010101111111011101101111000011100111111110011000011111111110--Lower Diagonal-3 : 1111111001011110000010111111111111111111111001001111011011001100011110110010101010100010101111111---Lower Diagonal-4 : 011110001111110011111111111101011110111011011111111101001011110011011111111110110010010111000101----Lower Diagonal-5 : 11111101011101010011100111100011100001000111011111111101111111011001111110001000001010100111110-----Lower Diagonal-6 : 1111111011111110110111011110111101010101111111110111011011111111000001101110111000110101101111------MAGNET bit-vector : 0000000000000000000100000000011000101000000001010000000000000000000000000000000000000000000000000001142 3 5 6 7

MAGNET should select this identical segment instead of the one highlighted in red

Figure 9.7: An example of a falsely-rejected mapping using MAGNET algorithmfor an edit distance threshold of 6. The random zeros (highlighted in red) confuseMAGNET ﬁlter causing it to select shorter segments of random zeros instead ofa longer identical subsequence (highlighted in blue).We aim to ﬁnd the subsequences that has the largest number of zeros, suchthat we maximize the number of matches and minimize the number of editsthat cause the division of one long identical sequence into shorter subsequences.However, this is not the case for MAGNET. We observe that MAGNET shows avery low false reject rate of less than 0.00045% for an edit distance threshold of atleast 4 bp. This is due in large part to the greedy choice of always selecting thelongest identical subsequence based on their descending length and regardlessof their source (i.e. the vectors that originate them). We provide in Figure9.7 an example of where MAGNET fails to maintain the correct mappings. Apotential solution is to relate the number of encapsulated bits to the source ofeach segment. For instance, if any extracted segment in the MAGNET mask hasa diﬀerent source than its neighboring segments, then we need to penalize thatsegment by adding more encapsulation bits. This can help us to produce accurate

CIGAR string for each mapping. With the help of an auxiliary data structure, wecan keep track of the source of each extracted segment at each position. While thematches coming from the “upper diagonal-1” mean there is a single deletion, thematches coming from the “upper diagonal-2” mean that there are two deletions.A detailed algorithm of this topic is beyond the scope of this thesis and is partof our future work. 104 .4 Eﬀects of Hardware Pre-Alignment Filter-ing on Read Alignment

We ﬁrst analyze the execution time of our hardware pre-alignment ﬁlters, Gate-Keeper, MAGNET, Shouji, and SneakySnake. We build the FPGA implemen-tation of SneakySnake using a sub-matrix’s width of 8 columns ( t =8) and weinclude 3 replications in the design. We use GateKeeper as an optimized and eﬃ-cient hardware implementation of SHD [65]. We evaluate our four pre-alignmentﬁlters using a single FPGA chip. We use 120 million sequence pairs, each of whichis 100 bp long, from set 1, set 2, set 3, and set 4. We summarize the executiontime of the CPU implementations along with that of their hardware acceleratorsin Table 9.4. We make two key observations based on Table 9.4.1. Our hardware accelerators provide two to three orders of magnitude (322xto 7,250x) speedup over their CPU implementations. GateKeeper providesup to two orders of magnitude of acceleration over SHD.2. The execution time of the hardware implementation of SneakySnake andShouji are as low as that of GateKeeper and 2x-8x lower than that of MAG-NET pre-alignment ﬁlter. This observation is in accord with our expecta-tion and can be explained by the fact that MAGNET has more computa-tional overhead that limits the number of ﬁltering units. Yet SneakySnakeis four and ﬁve orders of magnitude more accurate than both Shouji andGateKeeper (as we show earlier).We conclude that our hardware accelerator provides two and three orders ofmagnitude of speedup over their CPU implementations. Additionally, the exe-cution time of the hardware accelerator is proportional to the FPGA resourceutilization (the less the resource utilization the lower the execution time).Next we analyze the beneﬁts of integrating our hardware pre-alignment ﬁlterswith the state-of-the-art aligners. GateKeeper, MAGNET, Shouji, and SneakyS-nake are standalone pre-alignment ﬁlters and can be integrated with any existing105able 9.4: The execution time (in seconds) of GateKeeper, MAGNET, Shouji,and SneakySnake under diﬀerent edit distance thresholds. We use set 1 to set 4with a read length of 100. We provide the performance results for the CPUimplementations and the hardware accelerators with the maximum number ofﬁltering units. E (bp) GateKeeper MAGNET Shouji SneakySnake-FPGA E (bp) SHD MAGNET-CPU Shouji-CPU SneakySnake-5 alignment algorithm. In Table 9.5, we present the eﬀects of our four hardwarepre-alignment ﬁlters on the overall alignment’s execution time. We use a sub-matrix’s width of 8 columns ( t =8) and we include 3 replications in the designof the hardware architecture of the SneakySnake algorithm. We also comparethe eﬀect of our pre-alignment ﬁlters with that of SHD [65]. We select ﬁve bestperforming aligners, each of which is designed for diﬀerent type of computingplatforms. While Edlib [78] algorithm is implemented in C for standard CPUs,Parasail [54] exploits SIMD capable CPUs. GSWABE [57] is designed for GPUs.CUDASW++ 3.0 [59] exploits SIMD capability of both CPUs and GPUs. FP-GASW [53] exploit the very large number of hardware execution units oﬀered bythe same FPGA chip (i.e., VC709) as our accelerator. We evaluate the executiontime of Edlib [78] and Parasail [54] on our machine.106owever, FPGASW [53], CUDASW++ 3.0 [59], and GSWABE [57] are notopen-source and not available to us. Therefore, we scale the reported numberof computed entries of the dynamic programming matrix in a second. We use atotal of 120 million real sequence pairs from our previously described four datasets(set 1, set 2, set 3, and set 4) in this analysis. We make three key observations.1. The execution time of Edlib [78] reduces by up to 21.4x, 18.8x, 16.5x, 13.9x,and 5.2x after the addition of SneakySnake, Shouji, MAGNET, GateKeeper,and SHD, respectively, as a pre-alignment ﬁltering step. We also observenearly a similar trend for Parasail [54] combined with each of the four pre-alignment ﬁlters.2. Aligners designed for FPGAs and GPUs follow a diﬀerent trend than thatwe observe in the CPU aligners. We observe that the ability of SHD [65]to reduce the alignment time of GSWABE [57], CUDASW++ 3.0 [59], andFPGASW [53] diminishes. SHD even provides unsatisfactory performanceas it increases the execution time of the aligner instead of reducing it. Thisis due to the fact that SHD is 6x slower than CUDASW++ 3.0 [59] andFPGASW [53] and it is lightly slower than GSWABE [57].3. SneakySnake, Shouji, MAGNET, and GateKeeper still contribute signif-icantly towards reducing the overall execution time of FPGA and GPUbased aligners. SneakySnake reduces the execution time of FPGASW [53],CUDASW++ 3.0 [59] and GSWABE [57] by factors of up to 16x, 15.5x,and 20.3x, respectively. This is slightly higher (up to 1.3x) than the eﬀectof Shouji on the execution time of these aligners. Shouji reduces the overallalignment time of FPGASW [53], CUDASW++ 3.0 [59] and GSWABE [57]by factors of up to 14.5x, 14.2x, and 17.9x, respectively. This is up to 1.5x,1.4x, and 85x more than the eﬀect of MAGNET, GateKeeper, and SHD onthe end-to-end alignment time.We conclude that among the four hardware pre-alignment ﬁlters, SneakySnake(3-replication design and with t =8) is the best performing ﬁlter in terms of bothspeed and accuracy. 107able 9.5: End-to-end execution time (in seconds) for several state-of-the-artsequence alignment algorithms, with and without pre-alignment ﬁlters (SneakyS-nake, Shouji, MAGNET, GateKeeper, and SHD) and across diﬀerent edit distancethresholds. We use four datasets (set 1, set 2, set 3, and set 4) across diﬀerentedit distance thresholds. E (bp) Edlib w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) Parasail w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) FPGASW w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) CUDASW++ 3.0 w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) GSWABE w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD Integrating SneakySnake with aligner does not lead to negative eﬀects. We alsoconclude that the concept of pre-alignment ﬁltering is still eﬀective in boostingthe overall performance of the alignment step, even the dynamic programmingalgorithm is accelerated by the state-of-the-art hardware accelerators such asSIMD-capable CPUs, FPGAs, and GPUs.

Now we analyze the beneﬁts of integrating our CPU implementations of SneakyS-nake and the preﬁx edit distance with the state-of-the-art CPU aligners, Edlib[78] and Parasail [54]. 108e evaluate the end-to-end execution time of Edlib (referred to as path in [78]and conﬁgured as a banded global Levenshtein distance with CIGAR-enabledoutput) with and without the pre-alignment ﬁltering step for read lengths of 100bp (Figure 9.8), 150 bp (Figure 9.9), and 250 bp (Figure 9.10). We make threekey observations.1. SneakySnake-5 combined with the banded Edlib is up to 20.4x, 33x, and43x faster than Edlib without pre-alignment ﬁlter for E < m =100 bp), E <

12 bp ( m =150 bp), and E <

15 bp ( m =250 bp), where E is the editdistance threshold and m is the read length.2. Edlib-50 combined with the banded Edlib is slower than SneakySnake-5combined with the same aligner for low-edit datasets (Figure 9.8a-b, Fig-ure 9.9a-b, and Figure 9.10a-b). Edlib-50 becomes more eﬀective overSneakySnake-5 in reducing the execution time of Edlib across edit-richdatasets for E > m =100 bp and 150 bp) and E > m =250 bp).3. SHD leads to a negative eﬀect as it slows down the alignment speed of Edlibfor E > m =100 bp), E > m =150 bp and 250 bp).Secondly, we evaluate the eﬀects of adding these pre-alignment ﬁlters to Para-sail [54] for read lengths of 100 bp (Figure 9.11), 150 bp (Figure 9.12), and 250 bp(Figure 9.13). We conﬁgure Parasail as NW banded and with CIGAR-enabledoutput. We make three key observations.1. SneakySnake-5 still provides signiﬁcant beneﬁts to the the highest end-to-end speedup over all other pre-alignment ﬁlters when combined withParasail for E < m =100 bp), E < m =150 bp), and E <

10 bp( m =250 bp). SneakySnake-5 combined with Parasail yields up to 36.3x,42x, and 57.9x speedup over Parasail without a pre-alignment ﬁlter forread lengths of 100 bp, 150 bp, and 250 bp, respectively.2. Edlib-50 combined with Parasail becomes more eﬀective than SneakySnake-5 combined with Parasail for E > m =100 bp), E > m =150 bp),and E > m =250 bp). 109 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 070140210280350 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib(Path)050100150200250 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib(Path) 050100150200250 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib(Path) (a) (b)(d) (d)

Figure 9.8: End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment ﬁlters. We use four datasets ((a) set 1,(b) set 2, (c) set 3, and (d) set 4) across diﬀerent edit distance thresholds. Wehighlight in a dashed vertical line the edit distance threshold where Edlib startsto outperform our SneakySnake-5 algorithm.110 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib -8010100190280370460550 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib0100200300400500 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 0100200300400500 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib (a) (b)(d) (d)

Figure 9.9: End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment ﬁlters. We use four datasets ((a) set 5,(b) set 6, (c) set 7, and (d) set 8) across diﬀerent edit distance thresholds. Wehighlight in a dashed vertical line the edit distance threshold where Edlib startsto outperform our SneakySnake-5 algorithm.111 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 040080012001600 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib040080012001600 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 040080012001600 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib (a) (b)(d) (d)

Figure 9.10: End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment ﬁlters. We use four datasets ((a) set 9,(b) set 10, (c) set 11, and (d) set 12) across diﬀerent edit distance thresholds. Wehighlight in a dashed vertical line the edit distance threshold where Edlib startsto outperform our SneakySnake-5 algorithm.112 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Parasail (NW_banded+CIGAR)SHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0225450675900 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail0225450675900 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0225450675900 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail (a) (b)(d) (d)

Figure 9.11: End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment ﬁlters. We use four datasets ((a) set 1,(b) set 2, (c) set 3, and (d) set 4) across diﬀerent edit distance thresholds.3. SHD leads to slowing down the alignment speed of Parasail for E > m =100bp), E > m =150 bp and 250 bp).We conclude that our SneakySnake algorithm is the best performing CPU pre-alignment ﬁlter in terms of both speed and accuracy. It accelerates the state-of-the-art read alignment algorithms by up to an order of magnitude of acceleration.We demonstrate that SneakySnake algorithm does not lead to negative eﬀects foredit distance thresholds of 0% to 10% of the read length. We also want to empha-size that combining our SneakySnake pre-alignment ﬁlter (i.e., SneakySnake-5)with a preﬁx edit distance algorithm (i.e., Edlib-50) provides the fastest and themost accurate pre-alignment ﬁltering across wide range of edit distance thresholdsand read lengths. 113 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0400800120016002000 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail020040060080010001200140016001800 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0400800120016002000 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail (a) (b)(d) (d)

Figure 9.12: End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment ﬁlters. We use four datasets ((a) set 5,(b) set 6, (c) set 7, and (d) set 8) across diﬀerent edit distance thresholds.

In this section, we evaluate the space-eﬃciency beneﬁts of integrating ourSneakySnake algorithm with the state-of-the-art full read aligner algorithm, Edlib[78]. We use Valgrind massif tool to examine the memory utilization. We providethe memory footprint of Edlib in Figure 9.14. On average, Edlib shows a mem-ory footprint of 150 KB. We then evaluate the memory utilization of integratingEdlib (in edit distance mode and without backtracking) with Edlib (path) inFigure 9.15. The addition of exact edit distance algorithm to the read alignmentshows a slight reduction ( ∼ E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 09001800270036004500 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail09001800270036004500 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 09001800270036004500 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail (a) (b)(d) (d)

Figure 9.13: End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment ﬁlters. We use four datasets ((a) set 9,(b) set 10, (c) set 11, and (d) set 12) across diﬀerent edit distance thresholds.This observation is in accord with our expectation and can be explained bythe fact that SneakySnake-5 requires a spaces of as small as a sub-matrix ofsize 5 x 2 E +1 and at most m x 2 E +1, whereas Edlib always requires a spaceof m x m . We conclude that SneakySnake algorithm is fast, accurate, and yetmemory-eﬃcient. 115igure 9.14: Memory utilization of Edlib (path) read aligner while evaluatingset 12 for an edit distance threshold of 25.Figure 9.15: Memory utilization of exact edit distance algorithm (Edlib ED) com-bined with Edlib (path) read aligner while evaluating set 12 for an edit distancethreshold of 25. 116igure 9.16: Memory utilization of SneakySnake-5 combined with Edlib (path)read aligner while evaluating set 12 for an edit distance threshold of 25.117 hapter 10Conclusions and FutureDirections Our goal in this thesis is to considerably minimize the time spent on calculatingthe optimal alignment in genome analysis, given limited computational resources(i.e., personal computer or small hardware). To this end, we ﬁrst provide acomprehensive accuracy analysis of the pre-alignment ﬁltering. Understandingthe causes for the ﬁltering inaccuracy helps us to design new fast and accuratepre-alignment ﬁlters. Second, we propose the ﬁrst hardware accelerator archi-tecture for pre-alignment in genome read mapping. We leverage the large num-ber of ﬁltering units that our hardware accelerator oﬀers for accelerating ourproposed hardware-aware algorithms. We propose four hardware pre-alignmentﬁlters, GateKeeper, Shouji, MAGNET, and SneakySnake. In our experimentalevaluation, our hardware pre-alignment ﬁlters show, on average, three orders ofmagnitude speedup over their equivalent CPU implementations. We demonstratethat GateKeeper occupies the least percentage of the FPGA resource and it isthe least accurate ﬁlter. We show that MAGNET provides a low false accept ratebut incurs a very low rate of falsely rejected mappings. We also demonstrate thatShouji is more accurate than GateKeeper and faster than MAGNET. However,SneakySnake is our best performing pre-alignment ﬁlter in terms of both speedand accuracy. 118neakySnake has a very low false accept rate and 0% false reject rate. Wedemonstrate that SneakySnake reduces the execution time of existing read align-ers by up to an order of magnitude.Third, we introduce a fast and cost-eﬀective CPU implementation of our bestperforming pre-alignment ﬁlter, SneakySnake. In our comprehensive evaluation,we demonstrate that the CPU implementation of SneakySnake reduces the ex-ecution time of the best performing read aligner, Edlib and Parasail, by up to43x and 57.9x, respectively. We also experimentally demonstrate that SneakyS-nake has 50% less memory footprint compared with that of Edlib. The CPUimplementation of SneakySnake obviates the need for costly hardware and highhardware design eﬀorts by providing a fast and cost-eﬀective implementation.We demonstrate that the concept of pre-alignment ﬁltering provides substan-tial beneﬁts to the existing and future read alignment algorithms. New acceler-ated sequence aligners are frequently introduced that oﬀer diﬀerent strengths andfeatures. Our proposed pre-alignment ﬁlters oﬀer the ability to accelerate existingaligners by an order of magnitude without sacriﬁcing any of their capabilities andfeatures. As such, we hope that it catalyzes the adoption of our proposed pre-alignment ﬁlters in genome sequence analysis, which are becoming increasinglynecessary to cope with the processing requirements of greatly increasing amountsof genomic data.

This thesis opens up several avenues of future research directions. In this section,we describe ﬁve directions based on the ideas and approaches proposed in thisthesis. These ideas can lead to a new read mapper or improve existing ones,which we will explore for various mappers in our future research.119. The ﬁrst potential target of our research is to inﬂuence the design of moreintelligent and attractive sequencing machines by integrating SneakySnakeor Shouji inside them, to perform a real-time pre-alignment ﬁltering. Se-quencing machines (e.g., Illumina HiSeq 2500, HiSeq 4000, and MiSeq) areequipped with FPGA chips for accelerating their internal computations.Integrating our pre-alignment ﬁlters with the sequencing machine has twobeneﬁts. First, it allows a signiﬁcant reduction in the total execution timeof genome analysis by starting read mapping while still sequencing [144].Second, it can hide the complexity and details of the underlying hardwarefrom users who are not necessarily ﬂuent in FPGAs.2. Cloud-enabled pre-alignment ﬁltering in a pay-per-use fashion is also apromising solution to encourage the use of such pre-alignment ﬁlters andaddress the concern of lack of experience in dealing with hardware designs.Cloud computing oﬀers access to a large number of advanced FPGA chipsthat can be used concurrently via a simple user-friendly interface. However,such scenario requires the development of privacy-preserving pre-alignmentﬁlters due to privacy and legal concerns related to the share of sensitivegenome data to a third party [145, 146]. Our next eﬀorts will focus onexploring privacy-preserving real-time pre-alignment ﬁltering.3. Since a single-core SneakySnake/GateKeeper/Shouji has only a small foot-print on the FPGA, we can combine our architecture with any of the FPGA-based accelerators for BWT-FM or hash-based mapping techniques on asingle FPGA chip. With such a combination, the end result would be aneﬃcient and fast multi-layer mapping system: alignments that pass our pre-alignment ﬁlter can be further veriﬁed using a dynamic programing basedalignment algorithm within the same chip.4. To further improve the performance of our SneakySnake pre-alignment ﬁl-ter, we can take full advantage of the redundancy across both referencegenome and reads present in large sequencing data sets and store the op-timal path for small subsequences of the read and the reference sequences.We later use the pre-computed path towards calculating the approximateedit distance quickly. 120his requires the understanding of the trade-oﬀs between the memory foot-print of seed ﬁltering and the speed of pre-alignment ﬁltering. Anotherapproach for accelerating our SneakySnake algorithm is to exploit GPUs ormulti-threaded processors to achieve a highly parallel implementation.5. We also aim to explore the possibility of achieving full alignment step inlinear time complexity in term of read length. We believe that we canachieve this challenging goal by improving the accuracy of DFS algorithmin SneakySnake. If we penalize each potential path based on its distancefrom the main diagonal, we can eventually infer the the exact number ofedit distance along the entire path. This leads to not only calculate the editdistance accurately, but also to on-the-ﬂy provide the optimal alignmentwithout performing “backtracking” by printing the snake’s path. The pathcontains all needed information such as the location, the number, and thetype of the edits involved. 121 hapter 11Other Works of This Author

During the course of my doctoral studies, I had the opportunity to work closelyand collaborate with my fellow graduate students and researchers from manyother institutions. These projects help me to 1) learn more about genome anal-ysis, metagenomics, and bioinformatics in general, 2) develop my own ideas andskills to conduct a successful research, and 3) widen my network of good mentorsand researchers. In this chapter, I would like to acknowledge my collaboratorsand these projects related to this dissertation.In a collaboration with Eleazar Eskin, Serghei Mangul, David Koslicki, andNathan LaPierre, we worked on developing new methods for accurate and compre-hensive microbial community proﬁling. I also worked on developing an interactivevisualization tool for facilitating the analysis of the proﬁling results, which leavesthe exploration and interpretation to the user. This gave me the opportunityto extend my knowledge about metagenomics and gain more insights about theneed for accelerating metagenomics proﬁling tools. This work is called MiCoPand published in BMC Genomics [40] and in bioRxiv [147].I am very happy to start this collaboration and continue working together onother related projects. During the last year of my doctoral studies, we started122nother two new projects: 1) improving the speed and accuracy of our metage-nomics proﬁler using a new seed ﬁltering algorithm and a new read mapper and2) providing a survey of algorithmic foundations and methodologies across readmapping methods for both short and long reads. We discuss how longer readlengths produce unique advantages and limitations to read mapping techniques.We also discuss how general mapping algorithms have been tailored to the speciﬁcneeds of various domains in biology, including whole transcriptome, adaptive im-mune repertoire, and human microbiome studies. This work is the ﬁrst of its kindto provide a clear roadmap of how technology dictates read mapping algorithms.In a collaboration with Onur Mutlu, Jeremie Kim, and Damla Senol Cali,we developed GRIM-Filter, a pre-alignment ﬁlter for reducing the number ofinvalid mapping in genome analysis. It exploits the high memory bandwidthand the logic layer of 3D-stacked memory to perform highly-parallel ﬁlteringinside the 3D-stacked memory itself. This gave me the opportunity to learn moreabout 3D-stacked memory architecture, which has a lot in common with memoryarchitectures. This work is published in BMC Genomics [25] and available onarXiv [109].In a collaboration with Erman Ayday, Nour Almadhoun, and Azita Nouri,we survey a wide spectrum of cross-layer privacy breaching strategies to humangenomic data (using both public genomic databases and other public non-genomicdata). We outline the principles and outcomes of each technique, and assess itstechnological complexity and maturation. We then review potential privacy-preserving countermeasure mechanisms for each threat. This work introduces meinto the importance of developing privacy-preserving tools for genome analysisapplications. This work is presented in DPM [146] and in PRIVAGEN [148].All of these works are closely related to this dissertation. During these col-laborations, we all gained valuable expertise on improving genome analysis bylearning from each other. 123 ibliography [1] J. M. Lane, J. Liang, I. Vlasac, S. G. Anderson, D. A. Bechtold, J. Bow-den, R. Emsley, S. Gill, M. A. Little, A. I. Luik, et al. , “Genome-wideassociation analyses of sleep disturbance traits identify new loci and high-light shared genetics with neuropsychiatric and metabolic traits,”

Naturegenetics , vol. 49, no. 2, p. 274, 2017.[2] G. Dobigny, J. Britton-Davidian, and T. J. Robinson, “Chromosomal poly-morphism in mammals: an evolutionary perspective,”

Biological Reviews ,vol. 92, no. 1, pp. 1–21, 2017.[3] P. H. Sudmant, T. Rausch, E. J. Gardner, R. E. Handsaker, A. Abyzov,J. Huddleston, Y. Zhang, K. Ye, G. Jun, M. H.-Y. Fritz, et al. , “An in-tegrated map of structural variation in 2,504 human genomes,”

Nature ,vol. 526, no. 7571, p. 75, 2015.[4] A. Korte and A. Farlow, “The advantages and limitations of trait analysiswith GWAS: a review,”

Plant methods , vol. 9, no. 1, p. 29, 2013.[5] P. M. Visscher, M. A. Brown, M. I. McCarthy, and J. Yang, “Five yearsof GWAS discovery,”

The American Journal of Human Genetics , vol. 90,no. 1, pp. 7–24, 2012.[6] F. Antonacci, J. M. Kidd, T. Marques-Bonet, M. Ventura, P. Siswara,Z. Jiang, and E. E. Eichler, “Characterization of six human disease-associated inversion polymorphisms,”

Human molecular genetics , vol. 18,no. 14, pp. 2555–2566, 2009. 1247] P. K. Han, K. L. Umstead, B. A. Bernhardt, R. C. Green, S. Joﬀe,B. Koenig, I. Krantz, L. B. Waterston, L. G. Biesecker, and B. B. Biesecker,“A taxonomy of medical uncertainties in clinical genome sequencing,”

Ge-netics in Medicine , vol. 19, no. 8, p. 918, 2017.[8] P. M. Visscher, N. R. Wray, Q. Zhang, P. Sklar, M. I. McCarthy, M. A.Brown, and J. Yang, “10 years of GWAS discovery: biology, function, andtranslation,”

The American Journal of Human Genetics , vol. 101, no. 1,pp. 5–22, 2017.[9] L. Chin, J. N. Andersen, and P. A. Futreal, “Cancer genomics: from dis-covery science to personalized medicine,”

Nature medicine , vol. 17, no. 3,p. 297, 2011.[10] L. Ding, M. C. Wendl, D. C. Koboldt, and E. R. Mardis, “Analysis ofnext-generation genomic data in cancer: accomplishments and challenges,”

Human molecular genetics , vol. 19, no. R2, pp. R188–R196, 2010.[11] D. E. Pritchard, F. Moeckel, M. S. Villa, L. T. Housman, C. A. McCarty,and H. L. McLeod, “Strategies for integrating personalized medicine intohealthcare practice,”

Personalized Medicine , vol. 14, no. 2, pp. 141–152,2017.[12] E. D. Esplin, L. Oei, and M. P. Snyder, “Personalized sequencing and thefuture of medicine: discovery, diagnosis and defeat of disease,”

Pharma-cogenomics , vol. 15, no. 14, pp. 1771–1790, 2014.[13] M. L. Metzker, “Sequencing technologies—the next generation,”

Naturereviews genetics , vol. 11, no. 1, p. 31, 2010.[14] M. A. Hamburg and F. S. Collins, “The path to personalized medicine,”

New England Journal of Medicine , vol. 363, no. 4, pp. 301–304, 2010.[15] G. S. Ginsburg and J. J. McCarthy, “Personalized medicine: revolutioniz-ing drug discovery and patient care,”

TRENDS in Biotechnology , vol. 19,no. 12, pp. 491–496, 2001. 12516] J. A. Reuter, D. V. Spacek, and M. P. Snyder, “High-throughput sequencingtechnologies,”

Molecular cell , vol. 58, no. 4, pp. 586–597, 2015.[17] T. Smith and M. Waterman, “Identiﬁcation of Common Molecular Subse-quences,”

Molecular Biology , vol. 147, pp. 195–197, 1981.[18] H. Cheng, H. Jiang, J. Yang, Y. Xu, and Y. Shang, “BitMapper: an eﬃcientall-mapper based on bit-vector computing,”

BMC bioinformatics , vol. 16,no. 1, p. 192, 2015.[19] H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, and C. Alkan, “Ac-celerating read mapping with FastHASH,”

BMC genomics , vol. 14, no. 1,p. S13, 2013.[20] F. Hach, F. Hormozdiari, C. Alkan, F. Hormozdiari, I. Birol, E. E. Eichler,and S. C. Sahinalp, “mrsFAST: a cache-oblivious algorithm for short-readmapping,”

Nature methods , vol. 7, no. 8, p. 576, 2010.[21] H. Li, “Aligning sequence reads, clone sequences and assembly contigs withBWA-MEM,” arXiv preprint arXiv:1303.3997 , 2013.[22] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie2,”

Nature methods , vol. 9, no. 4, p. 357, 2012.[23] R. Luo, T. Wong, J. Zhu, C.-M. Liu, X. Zhu, E. Wu, L.-K. Lee, H. Lin,W. Zhu, D. W. Cheung, et al. , “SOAP3-dp: fast, accurate and sensitiveGPU-based short read aligner,”

PloS one , vol. 8, no. 5, p. e65632, 2013.[24] G. Navarro, “A guided tour to approximate string matching,”

ACM com-puting surveys (CSUR) , vol. 33, no. 1, pp. 31–88, 2001.[25] J. S. Kim, D. S. Cali, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan,O. Ergin, C. Alkan, and O. Mutlu, “GRIM-Filter: Fast seed location ﬁlter-ing in DNA read mapping using processing-in-memory technologies,”

BMCGenomics , vol. 19, no. 2, p. 89, 2018.[26] S. Marco-Sola, M. Sammeth, R. Guig´o, and P. Ribeca, “The GEM mapper:fast, accurate and versatile alignment by ﬁltration,”

Nature methods , vol. 9,no. 12, p. 1185, 2012. 12627] C. Firtina, J. S. Kim, M. Alser, D. S. Cali, A. E. Cicek, C. Alkan,and O. Mutlu, “Apollo: A Sequencing-Technology-Independent, Scal-able, and Accurate Assembly Polishing Algorithm,” arXiv preprintarXiv:1902.04341 , 2019.[28] E. J. Fox, K. S. Reid-Bayliss, M. J. Emond, and L. A. Loeb, “Accuracyof next generation sequencing platforms,”

Next generation, sequencing &applications , vol. 1, 2014.[29] K. J. McKernan, H. E. Peckham, G. L. Costa, S. F. McLaughlin, Y. Fu,E. F. Tsung, C. R. Clouser, C. Duncan, J. K. Ichikawa, C. C. Lee, et al. , “Sequence and structural variation in a human genome uncoveredby short-read, massively parallel ligation sequencing using two-base encod-ing,”

Genome research , vol. 19, no. 9, pp. 1527–1541, 2009.[30] C. Calude, K. Salomaa, and S. Yu, “Additive distances and quasi-distancesbetween words,”

Journal of Universal Computer Science , vol. 8, no. 2,pp. 141–152, 2002.[31] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions,and reversals,” in

Soviet physics doklady , vol. 10, pp. 707–710, 1966.[32] S. B. Needleman and C. D. Wunsch, “A general method applicable to thesearch for similarities in the amino acid sequence of two proteins,”

Journalof molecular biology , vol. 48, no. 3, pp. 443–453, 1970.[33] S. Canzar and S. L. Salzberg, “Short read mapping: an algorithmic tour,”

Proceedings of the IEEE , vol. 105, no. 3, pp. 436–458, 2017.[34] M. Escalona, S. Rocha, and D. Posada, “A comparison of tools for thesimulation of genomic next-generation sequencing data,”

Nature ReviewsGenetics , vol. 17, no. 8, p. 459, 2016.[35] M. A. Eberle, E. Fritzilas, P. Krusche, M. K¨allberg, B. L. Moore, M. A.Bekritsky, Z. Iqbal, H.-Y. Chuang, S. J. Humphray, A. L. Halpern, et al. ,“A reference data set of 5.4 million phased human variants validated by ge-netic inheritance from sequencing a three-generation 17-member pedigree,”

Genome research , vol. 27, no. 1, pp. 157–164, 2017.12736] G. X. Zheng, B. T. Lau, M. Schnall-Levin, M. Jarosz, J. M. Bell, C. M.Hindson, S. Kyriazopoulou-Panagiotopoulou, D. A. Masquelier, L. Merrill,J. M. Terry, et al. , “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing,”

Nature biotechnology , vol. 34, no. 3,p. 303, 2016.[37] M. Griﬃth, C. A. Miller, O. L. Griﬃth, K. Krysiak, Z. L. Skidmore,A. Ramu, J. R. Walker, H. X. Dang, L. Trani, D. E. Larson, et al. , “Opti-mizing cancer genome sequencing and analysis,”

Cell systems , vol. 1, no. 3,pp. 210–223, 2015.[38] I. Iossifov, B. J. O’roak, S. J. Sanders, M. Ronemus, N. Krumm, D. Levy,H. A. Stessman, K. T. Witherspoon, L. Vives, K. E. Patterson, et al. , “Thecontribution of de novo coding mutations to autism spectrum disorder,”

Nature , vol. 515, no. 7526, p. 216, 2014.[39] M. Meyerson, S. Gabriel, and G. Getz, “Advances in understanding cancergenomes through second-generation sequencing,”

Nature Reviews Genetics ,vol. 11, no. 10, p. 685, 2010.[40] N. LaPierre, S. Mangul, M. Alser, I. Mandric, N. C. Wu, D. Koslicki, andE. Eskin, “MiCoP: microbial community proﬁling method for detecting vi-ral and fungal organisms in metagenomic samples,”

BMC genomics , vol. 20,no. 5, p. 423, 2019.[41] J. Qin, Y. Li, Z. Cai, S. Li, J. Zhu, F. Zhang, S. Liang, W. Zhang, Y. Guan,D. Shen, et al. , “A metagenome-wide association study of gut microbiotain type 2 diabetes,”

Nature , vol. 490, no. 7418, p. 55, 2012.[42] N. Segata, L. Waldron, A. Ballarini, V. Narasimhan, O. Jousson, andC. Huttenhower, “Metagenomic microbial community proﬁling using uniqueclade-speciﬁc marker genes,”

Nature methods , vol. 9, no. 8, p. 811, 2012.[43] E. Afshinnekoo, C. Meydan, S. Chowdhury, D. Jaroudi, C. Boyer, N. Bern-stein, J. M. Maritz, D. Reeves, J. Gandara, S. Chhangawala, et al. ,“Geospatial resolution of human and bacterial diversity with city-scalemetagenomics,”

Cell systems , vol. 1, no. 1, pp. 72–87, 2015.12844] S. K. Delaney, M. L. Hultner, H. J. Jacob, D. H. Ledbetter, J. J. McCarthy,M. Ball, K. B. Beckman, J. W. Belmont, C. S. Bloss, M. F. Christman, et al. , “Toward clinical genomics in everyday medicine: perspectives andrecommendations,”

Expert review of molecular diagnostics , vol. 16, no. 5,pp. 521–532, 2016.[45] L. P´erez-Lago, M. Mart´ınez-Lirola, S. Garc´ıa, M. Herranz, I. Mokrousov,I. Comas, L. Mart´ınez-Priego, E. Bouza, and D. Garc´ıa-de Viedma, “UrgentImplementation in a Hospital Setting of a Strategy To Rule Out SecondaryCases Caused by Imported Extensively Drug-Resistant Mycobacterium tu-berculosis Strains at Diagnosis,”

Journal of clinical microbiology , vol. 54,no. 12, pp. 2969–2974, 2016.[46] S. F. Kingsmore, J. Petrikin, L. K. Willig, and E. Guest, “Emergency med-ical genomes: a breakthrough application of precision medicine,”

Genomemedicine , vol. 7, no. 1, p. 82, 2015.[47] L. K. Willig, J. E. Petrikin, L. D. Smith, C. J. Saunders, I. Thiﬀault, N. A.Miller, S. E. Soden, J. A. Cakici, S. M. Herd, G. Twist, et al. , “Whole-genome sequencing for identiﬁcation of Mendelian disorders in critically illinfants: a retrospective analysis of diagnostic and clinical ﬁndings,”

TheLancet Respiratory Medicine , vol. 3, no. 5, pp. 377–387, 2015.[48] N. M. Sweeney, S. A. Nahas, S. Chowdhury, M. Del Campo, M. C. Jones,D. P. Dimmock, S. F. Kingsmore, R. Investigators, et al. , “The case forearly use of rapid whole genome sequencing in management of criticallyill infants: Late diagnosis of Coﬃn-Siris syndrome in an infant with leftcongenital diaphragmatic hernia, congenital heart disease and recurrentinfections,”

Molecular Case Studies , pp. mcs–a002469, 2018.[49] L. Farnaes, A. Hildreth, N. M. Sweeney, M. M. Clark, S. Chowdhury, S. Na-has, J. A. Cakici, W. Benson, R. H. Kaplan, R. Kronick, et al. , “Rapidwhole-genome sequencing decreases infant morbidity and cost of hospital-ization,”

NPJ genomic medicine , vol. 3, no. 1, p. 10, 2018.[50] J. S. Berg, L. M. Amendola, C. Eng, E. Van Allen, S. W. Gray, N. Wagle,H. L. Rehm, E. T. DeChene, M. C. Dulik, F. M. Hisama, et al. , “Processes129nd preliminary outputs for identiﬁcation of actionable genes as incidentalﬁndings in genomic sequence data in the Clinical Sequencing ExploratoryResearch Consortium,”

Genetics in Medicine , vol. 15, no. 11, p. 860, 2013.[51] C. R. Ferreira, D. S. Regier, D. W. Hadley, P. S. Hart, and M. Muenke,“Medical genetics and genomic medicine in the United States of America.Part 1: history, demographics, legislation, and burden of disease,”

Molecu-lar genetics & genomic medicine , vol. 5, no. 4, pp. 307–316, 2017.[52] S. S. Banerjee, M. El-Hadedy, J. B. Lim, Z. T. Kalbarczyk, D. Chen,S. Lumetta, and R. K. Iyer, “ASAP: Accelerated Short-Read Alignmenton Programmable Hardware,” arXiv preprint arXiv:1803.02657 , 2018.[53] X. Fei, Z. Dan, L. Lina, M. Xin, and Z. Chunlei, “FPGASW: Acceler-ating Large-Scale Smith–Waterman Sequence Alignment Application withBacktracking on FPGA Linear Systolic Array,”

Interdisciplinary Sciences:Computational Life Sciences , vol. 10, no. 1, pp. 176–188, 2018.[54] J. Daily, “Parasail: SIMD C library for global, semi-global, and local pair-wise sequence alignments,”

BMC bioinformatics , vol. 17, no. 1, p. 81, 2016.[55] E. J. Houtgast, V.-M. Sima, K. Bertels, and Z. Al-Ars, “An FPGA-basedsystolic array to accelerate the BWA-MEM genomic mapping algorithm,”in

Embedded Computer Systems: Architectures, Modeling, and Simulation(SAMOS), 2015 International Conference on , pp. 221–227, IEEE, 2015.[56] E. Georganas, A. Bulu¸c, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick,“meraligner: A fully parallel sequence aligner,” in

Parallel and DistributedProcessing Symposium (IPDPS), 2015 IEEE International , pp. 561–570,IEEE, 2015.[57] Y. Liu and B. Schmidt, “GSWABE: faster GPU-accelerated sequence align-ment with optimal alignment retrieval for short DNA sequences,”

Concur-rency and Computation: Practice and Experience , vol. 27, no. 4, pp. 958–972, 2015. 13058] H. M. Waidyasooriya, M. Hariyama, and M. Kameyama, “FPGA-accelerator for DNA sequence alignment based on an eﬃcient data-dependent memory access scheme,”

Highly-Eﬃcient Accelerators and Re-conﬁgurable Technologies , pp. 127–130, 2014.[59] Y. Liu, A. Wirawan, and B. Schmidt, “CUDASW++ 3.0: acceleratingSmith-Waterman protein database search by coupling CPU and GPU SIMDinstructions,”

BMC bioinformatics , vol. 14, no. 1, p. 117, 2013.[60] J. Arram, K. H. Tsoi, W. Luk, and P. Jiang, “Hardware acceleration ofgenetic sequence alignment,” in

International Symposium on Applied Re-conﬁgurable Computing , pp. 13–24, Springer, 2013.[61] D. Weese, M. Holtgrewe, and K. Reinert, “RazerS 3: faster, fully sensitiveread mapping,”

Bioinformatics , vol. 28, no. 20, pp. 2592–2599, 2012.[62] A. Ahmadi, A. Behm, N. Honnalli, C. Li, L. Weng, and X. Xie, “Hobbes:optimized gram-based methods for eﬃcient read alignment,”

Nucleic acidsresearch , vol. 40, no. 6, pp. e41–e41, 2011.[63] G. Rizk and D. Lavenier, “GASSST: global alignment short sequence searchtool,”

Bioinformatics , vol. 26, no. 20, pp. 2534–2540, 2010.[64] D. Weese, A.-K. Emde, T. Rausch, A. D¨oring, and K. Reinert, “Raz-erS—fast read mapping with sensitivity control,”

Genome research , vol. 19,no. 9, pp. 1646–1654, 2009.[65] H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, andO. Mutlu, “Shifted Hamming distance: a fast and accurate SIMD-friendlyﬁlter to accelerate alignment veriﬁcation in read mapping,”

Bioinformatics ,vol. 31, no. 10, pp. 1553–1560, 2015.[66] T. Nishimura, J. L. Bordim, Y. Ito, and K. Nakano, “Accelerating theSmith-Waterman Algorithm Using Bitwise Parallel Bulk ComputationTechnique on GPU,” in

Parallel and Distributed Processing SymposiumWorkshops (IPDPSW), 2017 IEEE International , pp. 932–941, IEEE, 2017.13167] P. Chen, C. Wang, X. Li, and X. Zhou, “Accelerating the next generationlong read mapping with the FPGA-based system,”

IEEE/ACM transac-tions on computational biology and bioinformatics , vol. 11, no. 5, pp. 840–852, 2014.[68] H. M. Waidyasooriya and M. Hariyama, “Hardware-acceleration of short-read alignment based on the Burrows-Wheeler transform,”

IEEE Transac-tions on Parallel and Distributed Systems , vol. 27, no. 5, pp. 1358–1372,2016.[69] R. C. Edgar and K. Sj¨olander, “A comparison of scoring functions for pro-tein sequence proﬁle alignment,”

Bioinformatics , vol. 20, no. 8, pp. 1301–1308, 2004.[70] H. Xin, S. Nahar, R. Zhu, J. Emmons, G. Pekhimenko, C. Kingsford,C. Alkan, and O. Mutlu, “Optimal seed solver: optimizing seed selection inread mapping,”

Bioinformatics , vol. 32, no. 11, pp. 1632–1642, 2015.[71] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, “Gate-Keeper: a new hardware architecture for accelerating pre-alignment in DNAshort read mapping,”

Bioinformatics , vol. 33, no. 21, pp. 3355–3363, 2017.[72] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, “Gate-Keeper: A New Hardware Architecture for Accelerating Pre-Alignment inDNA Short Read Mapping,” arXiv preprint arXiv:1604.01789 , 2016.[73] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, “Shouji: a fastand eﬃcient pre-alignment ﬁlter for sequence alignment,”

Bioinformatics ,2019.[74] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, “SLIDER: Fastand Eﬃcient Computation of Banded Sequence Alignment,” arXiv preprintarXiv:1809.07858 , 2018.[75] M. Alser, O. Mutlu, and C. Alkan, “MAGNET: Understanding and im-proving the accuracy of genome pre-alignment ﬁltering,”

Transactions onInternet Research , vol. 13, no. 2, pp. 33–42, 2017.13276] M. Alser, O. Mutlu, and C. Alkan, “MAGNET: Understanding and im-proving the accuracy of genome pre-alignment ﬁltering,” arXiv preprintarXiv:1707.01631 , 2017.[77] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, “Explor-ing Speed/Accuracy Trade-oﬀs in Hardware Accelerated Pre-Alignment inGenome Analysis,” in

HPCA2018 Workshop on Accelerator Architecture inComputational Biology and Bioinformatics (AACBB), , Vienna, Austria,February 24th, 2018. https://aacbb-workshop.github.io/ , 2018.[78] M. ˇSoˇsi´c and M. ˇSiki´c, “Edlib: a C/C++ library for fast, exact sequencealignment using edit distance,”

Bioinformatics , vol. 33, no. 9, pp. 1394–1395, 2017.[79] P. Weiner, “Linear pattern matching algorithms,” in

Switching and Au-tomata Theory, 1973. SWAT’08. IEEE Conference Record of 14th AnnualSymposium on , pp. 1–11, IEEE, 1973.[80] M. T. ¨Ozsu and P. Valduriez,

Principles of distributed database systems .Springer Science & Business Media, 2011.[81] M. BURROWS, “A block-sorting lossless data compression algorithm,”

SRC Research Report, 124 , 1994.[82] P. Ferragina and G. Manzini, “Opportunistic data structures with applica-tions,” in

Foundations of Computer Science, 2000. Proceedings. 41st AnnualSymposium on , pp. 390–398, IEEE, 2000.[83] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, andM. Brudno, “SHRiMP: accurate mapping of short color-space reads,”

PLoScomputational biology , vol. 5, no. 5, p. e1000386, 2009.[84] H. Li and R. Durbin, “Fast and accurate short read alignment withBurrows–Wheeler transform,”

Bioinformatics , vol. 25, no. 14, pp. 1754–1760, 2009.[85] H. Li and R. Durbin, “Fast and accurate long-read alignment with Burrows–Wheeler transform,”

Bioinformatics , vol. 26, no. 5, pp. 589–595, 2010.13386] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast andmemory-eﬃcient alignment of short DNA sequences to the human genome,”

Genome biology , vol. 10, no. 3, p. R25, 2009.[87] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang,“SOAP2: an improved ultrafast tool for short read alignment,”

Bioinfor-matics , vol. 25, no. 15, pp. 1966–1967, 2009.[88] C.-M. Liu, T. Wong, E. Wu, R. Luo, S.-M. Yiu, Y. Li, B. Wang, C. Yu,X. Chu, K. Zhao, et al. , “SOAP3: ultra-fast GPU-based parallel alignmenttool for short reads,”

Bioinformatics , vol. 28, no. 6, pp. 878–879, 2012.[89] C. Firtina and C. Alkan, “On genomic repeats and reproducibility,”

Bioin-formatics , vol. 32, no. 15, pp. 2243–2247, 2016.[90] I. Medina, J. T´arraga, H. Mart´ınez, S. Barrachina, M. Castillo, J. Paschall,J. Salavert-Torres, I. Blanquer-Espert, V. Hern´andez-Garc´ıa, E. Quintana-Ort´ı, et al. , “Highly sensitive and ultrafast read mapping for RNA-seq anal-ysis,”

DNA Research , vol. 23, no. 2, pp. 93–100, 2016.[91] R. Li, Y. Li, K. Kristiansen, and J. Wang, “SOAP: short oligonucleotidealignment program,”

Bioinformatics , vol. 24, no. 5, pp. 713–714, 2008.[92] H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads andcalling variants using mapping quality scores,”

Genome research , vol. 18,no. 11, pp. 1851–1858, 2008.[93] A. D. Smith, Z. Xuan, and M. Q. Zhang, “Using quality scores and longerreads improves accuracy of Solexa read mapping,”

BMC bioinformatics ,vol. 9, no. 1, p. 128, 2008.[94] H. Lin, Z. Zhang, M. Q. Zhang, B. Ma, and M. Li, “ZOOM! Zillions ofoligos mapped,”

Bioinformatics , vol. 24, no. 21, pp. 2431–2437, 2008.[95] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basiclocal alignment search tool,”

Journal of molecular biology , vol. 215, no. 3,pp. 403–410, 1990. 13496] N. Homer, B. Merriman, and S. F. Nelson, “BFAST: an alignment tool forlarge scale genome resequencing,”

PloS one , vol. 4, no. 11, p. e7767, 2009.[97] M. David, M. Dzamba, D. Lister, L. Ilie, and M. Brudno, “SHRiMP2:sensitive yet practical short read mapping,”

Bioinformatics , vol. 27, no. 7,pp. 1011–1012, 2011.[98] F. Hormozdiari, F. Hach, S. C. Sahinalp, E. E. Eichler, and C. Alkan,“Sensitive and fast mapping of di-base encoded reads,”

Bioinformatics ,vol. 27, no. 14, pp. 1915–1921, 2011.[99] W.-P. Lee, M. P. Stromberg, A. Ward, C. Stewart, E. P. Garrison, and G. T.Marth, “MOSAIK: a hash-based algorithm for accurate next-generationsequencing short-read mapping,”

PloS one , vol. 9, no. 3, p. e90581, 2014.[100] J. Blom, T. Jakobi, D. Doppmeier, S. Jaenicke, J. Kalinowski, J. Stoye,and A. Goesmann, “Exact and complete short-read alignment to microbialgenomes using Graphics Processing Unit programming,”

Bioinformatics ,vol. 27, no. 10, pp. 1351–1358, 2011.[101] E. Ukkonen, “Algorithms for approximate string matching,”

Informationand control , vol. 64, no. 1-3, pp. 100–118, 1985.[102] D. Yorukoglu, Y. W. Yu, J. Peng, and B. Berger, “Compressive mapping fornext-generation sequencing,”

Nature biotechnology , vol. 34, no. 4, p. 374,2016.[103] J. Arram, K. H. Tsoi, W. Luk, and P. Jiang, “Reconﬁgurable acceleration ofshort read mapping,” in

Field-Programmable Custom Computing Machines(FCCM), 2013 IEEE 21st Annual International Symposium on , pp. 210–217, IEEE, 2013.[104] C. B. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, andW. L. Ruzzo, “Hardware acceleration of short read mapping,” in

Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th An-nual International Symposium on , pp. 161–168, IEEE, 2012.135105] Y. Sogabe and T. Maruyama, “FPGA acceleration of short read mappingbased on sort and parallel comparison,” in

Field Programmable Logic andApplications (FPL), 2014 24th International Conference on , pp. 1–4, IEEE,2014.[106] S. M. Kie(cid:32)lbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith, “Adaptiveseeds tame genomic sequence comparison,”

Genome research , vol. 21, no. 3,pp. 487–493, 2011.[107] L. Egidi and G. Manzini, “Better spaced seeds using quadratic residues,”

Journal of Computer and System Sciences , vol. 79, no. 7, pp. 1144–1155,2013.[108] R. Lederman, “A random-permutations-based approach to fast read align-ment,” in

BMC bioinformatics , vol. 14, p. S8, BioMed Central, 2013.[109] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin,C. Alkan, and O. Mutlu, “GRIM-Filter: fast seed ﬁltering in read mappingusing emerging memory technologies,” arXiv preprint arXiv:1708.04329 ,2017.[110] J. Kim, D. Senol, H. Xin, D. Lee, M. Alser, H. Hassan, O. Ergin, C. Alkan,and O. Mutlu, “Genome Read In-Memory (GRIM) Filter: Fast LocationFiltering in DNA Read Mapping using Emerging Memory Technologies,”2017.[111] W. J. Masek and M. S. Paterson, “A faster algorithm computing stringedit distances,”

Journal of Computer and System Sciences , vol. 20, no. 1,pp. 18–31, 1980.[112] A. Backurs and P. Indyk, “Edit Distance Cannot Be Computed inStrongly Subquadratic Time (unless SETH is false),” arXiv preprintarXiv:1412.0348v4 , 2017.[113] A. Al Kawam, S. Khatri, and A. Datta, “A Survey of Software and Hard-ware Approaches to Performing Read Alignment in Next Generation Se-quencing,”

IEEE/ACM Transactions on Computational Biology and Bioin-formatics (TCBB) , vol. 14, no. 6, pp. 1202–1213, 2017.136114] S. Aluru and N. Jammula, “A review of hardware acceleration for compu-tational genomics,”

IEEE Design & Test , vol. 31, no. 1, pp. 19–30, 2014.[115] H.-C. Ng, S. Liu, and W. Luk, “Reconﬁgurable acceleration of geneticsequence alignment: A survey of two decades of eﬀorts,” in

Field Pro-grammable Logic and Applications (FPL), 2017 27th International Confer-ence on , pp. 1–8, IEEE, 2017.[116] E. F. D. O. Sandes, A. Boukerche, and A. C. M. A. D. Melo, “Paralleloptimal pairwise biological sequence comparison: Algorithms, platforms,and classiﬁcation,”

ACM Computing Surveys (CSUR) , vol. 48, no. 4, p. 63,2016.[117] H. Kung, “Why Systolic Architectures?,”

Computer , vol. 15, no. 1, pp. 37–46, 1982.[118] Y.-T. Chen, J. Cong, Z. Fang, J. Lei, and P. Wei, “When apache sparkmeets FPGAs: a case study for next-generation DNA sequencing accel-eration,” in

Proceedings of the 8th USENIX Conference on Hot Topics inCloud Computing , pp. 64–70, USENIX Association, 2016.[119] Y.-T. Chen and J. Cong, “CS-BWAMEM: A fast and scalable read alignerat the cloud scale for whole genome sequencing,” in

Proceedings of HighThroughput Sequencing Algorithms and Applications (HITSEQ) , 2015.[120] C. Wang, R.-X. Yan, X.-F. Wang, J.-N. Si, and Z. Zhang, “Comparisonof linear gap penalties and proﬁle-based variable gap penalties in proﬁle–proﬁle alignments,”

Computational biology and chemistry , vol. 35, no. 5,pp. 308–318, 2011.[121] S. Henikoﬀ and J. G. Henikoﬀ, “Amino acid substitution matrices fromprotein blocks,”

Proceedings of the National Academy of Sciences , vol. 89,no. 22, pp. 10915–10919, 1992.[122] A. Hatem, D. Bozda˘g, A. E. Toland, and ¨U. V. C¸ ataly¨urek, “Benchmarkingshort sequence mapping tools,”

BMC bioinformatics , vol. 14, no. 1, p. 184,2013. 137123] M. C. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, andD. DiSabello, “Achieving high performance with FPGA-based computing,”

Computer , vol. 40, no. 3, 2007.[124] S. M. Trimberger, “Three ages of FPGAs: A retrospective on the ﬁrstthirty years of FPGA technology,”

Proceedings of the IEEE , vol. 103, no. 3,pp. 318–331, 2015.[125] N. McVICAR, A. Hoshino, A. La Torre, T. A. Reh, W. L. Ruzzo, andS. Hauck, “FPGA Acceleration of Short Read Alignment,” arXiv preprintarXiv:1805.00106 , 2018.[126] N. Alachiotis, D. Theodoropoulos, and D. Pnevmatikatos, “Versatile de-ployment of FPGA accelerators in disaggregated data centers: A bioinfor-matics case study,” in

Field Programmable Logic and Applications (FPL),2017 27th International Conference on , pp. 1–4, IEEE, 2017.[127] M. Assaad and M. Alser, “An FPGA-based design and implementation ofan all-digital serializer for inter module communication in SoC,”

IEICEElectronics Express , vol. 8, no. 23, pp. 2017–2023, 2011.[128] M. H. Alser,

Accelerating the understanding of life’s code through betteralgorithms and hardware design . PhD thesis, Bilkent University, 2018.[129] M. H. Alser, M. M. Assaad, and F. A. Hussin, “A wide-range programmablefrequency synthesizer based on a ﬁnite state machine ﬁlter,”

InternationalJournal of Electronics , vol. 100, no. 11, pp. 1546–1556, 2013.[130] M. Assaad and M. H. Alser, “Design of an all-digital synchronized fre-quency multiplier based on a dual-loop (D/FLL) architecture,”

VLSI De-sign , vol. 2012, p. 17, 2012.[131] M. H. Alser, M. Assaad, F. A. Hussin, and I. Yohannes, “Design and fpgaimplementation of pll-based quarter-rate clock and data recovery circuit,”in , vol. 2, pp. 825–830, IEEE, 2012.138132] M. H. Alser and M. M. Assaad, “Design and modeling of low-power clocklessserial link for data communication systems,” in , pp. 1–5, IEEE, 2011.[133] X. Virtex, “FPGA VC709 connectivity kit,”

Xilinx: All Programmable , 7.[134] M. Jacobsen, D. Richmond, M. Hogains, and R. Kastner, “RIFFA 2.1: Areusable integration framework for FPGA accelerators,”

ACM Transactionson Reconﬁgurable Technology and Systems (TRETS) , vol. 8, no. 4, p. 22,2015.[135] U. Guide, “Series FPGAs conﬁgurable logic block,”

Xilinx, San Jose, CA ,vol. 1, 7.[136] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A.Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memoryaccelerator for bulk bitwise operations using commodity DRAM technol-ogy,” in

Proceedings of the 50th Annual IEEE/ACM International Sympo-sium on Microarchitecture , pp. 273–287, ACM, 2017.[137] D. J. Lipman and W. R. Pearson, “Rapid and sensitive protein similaritysearches,”

Science , vol. 227, no. 4693, pp. 1435–1441, 1985.[138] J. W. J. Williams, “Algorithm 232: heapsort,”

Communications of theACM , vol. 7, no. 6, pp. 347–348, 1964.[139] M. McNamara et al. , “IEEE Standard Verilog Hardware Description Lan-guage. The Institute of Electrical and Electronics Engineers,”

Inc. IEEEStd , pp. 1364–2001, 2001.[140] R. Tarjan, “Depth-ﬁrst search and linear graph algorithms,”

SIAM journalon computing , vol. 1, no. 2, pp. 146–160, 1972.[141] G. Dimitrakopoulos, K. Galanopoulos, C. Mavrokefalidis, and D. Niko-los, “Low-power leading-zero counting and anticipation logic for high-speedﬂoating point units,”

IEEE transactions on very large scale integration(VLSI) systems , vol. 16, no. 7, pp. 837–850, 2008.139142] C. Alkan, J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hor-mozdiari, J. O. Kitzman, C. Baker, M. Malig, O. Mutlu, et al. , “Person-alized copy number and segmental duplication maps using next-generationsequencing,”

Nature genetics , vol. 41, no. 10, p. 1061, 2009.[143] . G. P. Consortium et al. , “An integrated map of genetic variation from1,092 human genomes,”

Nature , vol. 491, no. 7422, p. 56, 2012.[144] M. S. Lindner, B. Strauch, J. M. Schulze, S. H. Tausch, P. W. Dabrowski,A. Nitsche, and B. Y. Renard, “HiLive: real-time mapping of illumina readswhile sequencing,”

Bioinformatics , vol. 33, no. 6, pp. 917–319, 2016.[145] S. Salinas and P. Li, “Secure Cloud Computing for Pairwise Sequence Align-ment,” in

Proceedings of the 8th ACM International Conference on Bioin-formatics, Computational Biology, and Health Informatics , pp. 178–183,ACM, 2017.[146] M. Alser, N. Almadhoun, A. Nouri, C. Alkan, and E. Ayday, “Can you Re-ally Anonymize the Donors of Genomic Data in Today’s Digital World?,” in

Data Privacy Management, and Security Assurance , pp. 237–244, Springer,2015.[147] N. LaPierre, S. Mangul, M. Alser, I. Mandric, N. C. Wu, D. Koslicki, andE. Eskin, “MiCoP: Microbial Community Proﬁling method for detectingviral and fungal organisms in metagenomic samples,” bioRxiv , p. 243188,2018.[148] M. Alser, N. Almadhoun, A. Nouri, C. Alkan, and E. Ayday, “IdentifyingAnonymous Donors of Genetic Information,” 2015.140 ppendix AData

In this chapter, we use Edlib [78] (edit distance mode) to assess the numberof accepted (i.e., having edits less or equal to the edit distance threshold) andrejected (i.e., having more edits than the edit distance threshold) pairs for eachof the 12 datasets. We also provide the time, in seconds, that is needed for Edlibto complete clustering the pairs. We provide these details for set 1, set 2, set 3,and set 4 in Table A.1. We also provide the same details for set 5, set 6, set 7,and set 8 in Table A.2 and for set 9, set 10, set 11, and set 12 in Table A.3.Next, we provide the eﬀect of reducing the search space of our SneakySnakealgorithm and the best exact edit distance algorithm, Edlib [78]. We use thefollowing datasets set 1, set 2, set 3, and set 4 in Figure A.4. We also use thesedatasets set 5, set 6, set 7, and set 8 in Figure A.5. Finally, in Figure A.6, weuse the following datasets set 9, set 10, set 11, and set 12.141able A.1: Details of our ﬁrst four datasets (set 1, set 2, set 3, and set 4). Weuse Edlib [78] (edit distance mode) to benchmark the accepted and the rejectedpairs for edit distance thresholds of 0 up to 10 edits.

Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 110.61 381,901 29,618,099 0 120.84 124,531 29,875,469 1 135.30 1,345,842 28,654,158 1 121.67 441,927 29,558,073 2 126.87 3,266,455 26,733,545 2 121.33 1,073,808 28,926,192 3 125.46 5,595,596 24,404,404 3 117.88 2,053,181 27,946,819 4 122.09 7,825,272 22,174,728 4 113.79 3,235,057 26,764,943 5 121.73 9,821,308 20,178,692 5 115.00 4,481,341 25,518,659 6 126.26 11,650,490 18,349,510 6 115.26 5,756,432 24,243,568 7 124.86 13,407,801 16,592,199 7 116.10 7,091,373 22,908,627 8 126.83 15,152,501 14,847,499 8 116.71 8,531,811 21,468,189 9 122.94 16,894,680 13,105,320 9 117.47 10,102,726 19,897,274 10 122.60 18,610,897 11,389,103 10 118.49 11,807,488 18,192,512 Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 141.01 11,989 29,988,011 0 124.32 11 29,999,989 1 130.52 44,565 29,955,435 1 115.18 18 29,999,982 2 123.52 108,979 29,891,021 2 110.56 24 29,999,976 3 125.81 206,903 29,793,097 3 111.59 27 29,999,973 4 120.18 334,712 29,665,288 4 114.13 29 29,999,971 5 120.08 490,670 29,509,330 5 113.80 34 29,999,966 6 119.60 675,357 29,324,643 6 115.74 83 29,999,917 7 124.48 891,447 29,108,553 7 116.70 177 29,999,823 8 125.64 1,151,447 28,848,553 8 118.12 333 29,999,667 9 123.18 1,469,996 28,530,004 9 120.11 711 29,999,289 10 131.24 1,868,827 28,131,173 10 123.42 1,627 29,998,373 ERR240727_1, mrFAST -e=2, 30 million pairs ERR240727_1, mrFAST -e=3, 30 million pairsERR240727_1, mrFAST -e=5, 30 million pairs ERR240727_1, mrFAST -e=40, 30 million pairs

Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 202.82 1,440,497 28,559,503 0 185.55 248,920 29,751,080 1 170.93 1,868,909 28,131,091 1 187.66 324,056 29,675,944 3 179.84 2,734,841 27,265,159 3 190.54 481,724 29,518,276 4 185.39 3,457,975 26,542,025 4 195.78 612,747 29,387,253 6 194.29 5,320,713 24,679,287 6 208.64 991,606 29,008,394 7 200.69 6,261,628 23,738,372 7 216.06 1,226,695 28,773,305 9 207.89 7,916,882 22,083,118 9 225.75 1,740,067 28,259,933 10 208.18 8,658,021 21,341,979 10 227.97 2,009,835 27,990,165 12 208.92 10,131,849 19,868,151 12 237.14 2,591,299 27,408,701 13 208.01 10,917,472 19,082,528 13 238.31 2,923,699 27,076,301 15 208.96 12,646,165 17,353,835 15 233.91 3,730,089 26,269,911 Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 293.88 444 29,999,556 0 244.79 201 29,999,799 1 273.39 695 29,999,305 1 248.65 327 29,999,673 3 269.15 927 29,999,073 3 253.53 444 29,999,556 4 264.98 994 29,999,006 4 258.86 475 29,999,525 6 281.14 1,097 29,998,903 6 280.37 529 29,999,471 7 282.71 1,136 29,998,864 7 286.30 546 29,999,454 9 287.50 1,221 29,998,779 9 292.03 587 29,999,413 10 285.02 1,274 29,998,726 10 293.44 612 29,999,388 12 290.85 1,701 29,998,299 12 301.29 710 29,999,290 13 291.93 2,146 29,997,854 13 302.68 796 29,999,204 15 287.24 3,921 29,996,079 15 299.55 1,153 29,998,847 SRR826460_1, mrFAST -e=4, 30 million pairs SRR826460_1, mrFAST -e=6, 30 million pairsSRR826460_1, mrFAST -e=10, 30 million pairs SRR826460_1, mrFAST -e=70, 30 million pairs

Table A.3: Details of our last four datasets (set 9, set 10, set 11, and set 12). Weuse Edlib [78] (edit distance mode) to benchmark the accepted and the rejectedpairs for edit distance thresholds of 0 up to 25 edits.

Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 337.06 707,517 29,292,483 0 375.35 43,565 29,956,435 2 319.08 1,462,242 28,537,758 2 390.71 88,141 29,911,859 5 340.47 1,973,835 28,026,165 5 449.10 119,100 29,880,900 7 349.22 2,361,418 27,638,582 7 457.05 145,290 29,854,710 10 345.73 3,183,271 26,816,729 10 462.65 205,536 29,794,464 12 339.16 3,862,776 26,137,224 12 464.26 257,360 29,742,640 15 340.40 4,915,346 25,084,654 15 528.12 346,809 29,653,191 17 340.23 5,550,869 24,449,131 17 572.97 409,978 29,590,022 20 341.09 6,404,832 23,595,168 20 521.83 507,177 29,492,823 22 341.57 6,959,616 23,040,384 22 468.11 572,769 29,427,231 25 342.30 7,857,750 22,142,250 25 467.72 673,254 29,326,746 Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 502.22 4,389 29,995,611 0 449.02 49 29,999,951 2 469.97 8,970 29,991,030 2 467.78 163 29,999,837 5 505.53 12,420 29,987,580 5 524.47 301 29,999,699 7 490.44 15,405 29,984,595 7 533.67 375 29,999,625 10 488.32 22,014 29,977,986 10 612.16 472 29,999,528 12 482.28 27,817 29,972,183 12 655.45 520 29,999,480 15 483.46 37,710 29,962,290 15 657.91 575 29,999,425 17 483.10 44,225 29,955,775 17 547.26 623 29,999,377 20 484.07 54,650 29,945,350 20 542.69 718 29,999,282 22 483.88 62,255 29,937,745 22 541.08 842 29,999,158 25 482.39 74,761 29,925,239 25 544.22 1,133 29,998,867 SRR826471_1, mrFAST -e=8, 30 million pairs SRR826471_1, mrFAST -e=12, 30 million pairsSRR826471_1, mrFAST -e=15, 30 million pairs SRR826471_1, mrFAST -e=100, 30 million pairs E Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 E Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 s e t _ s e t _ s e t _ s e t _ Table A.5: Details of evaluating the feasibility of reducing the search space forSneakySnake and Edlib, evaluated using set 5, set 6, set 7, and set 8 datasets. E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 s e t _ s e t _ s e t _ s e t _ E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 s e t _ s e t _ s e t _ s e t _12