Accelerating the Understanding of Life's Code Through Better Algorithms and Hardware Design
TThis thesis received the 2018 IEEE Turkey Doctoral Dissertation Award**This copy was updated on October 2019**
ACCELERATING THE UNDERSTANDINGOF LIFE’S CODE THROUGH BETTERALGORITHMS AND HARDWARE DESIGN a dissertation submitted tothe graduate school of engineering and scienceof bilkent universityin partial fulfillment of the requirements forthe degree ofdoctor of philosophyincomputer engineering
ByMohammed H. K. AlserJune 2018 a r X i v : . [ q - b i o . GN ] O c t CCELERATING THE UNDERSTANDING OF LIFE’S CODETHROUGH BETTER ALGORITHMS AND HARDWARE DESIGNBy Mohammed H. K. AlserJune 2018We certify that we have read this dissertation and that in our opinion it is fullyadequate, in scope and in quality, as a dissertation for the degree of Doctor ofPhilosophy. Can Alkan (Advisor)Onur Mutlu (Co-Advisor)Ergin AtalarMehmet Koyut¨urkNurcan Tun¸cba˘gO˘guz ErginApproved for the Graduate School of Engineering and Science:Ezhan Kara¸sanDirector of the Graduate Schoolii
BSTRACTACCELERATING THE UNDERSTANDING OF LIFE’SCODE THROUGH BETTER ALGORITHMS ANDHARDWARE DESIGN
Mohammed H. K. AlserPh.D. in Computer EngineeringAdvisor: Can AlkanCo-Advisor: Onur MutluJune 2018Our understanding of human genomes today is affected by the ability of moderncomputing technology to quickly and accurately determine an individual’s entiregenome. Over the past decade, high throughput sequencing (HTS) technologieshave opened the door to remarkable biomedical discoveries through its ability togenerate hundreds of millions to billions of DNA segments per run along with asubstantial reduction in time and cost. However, this flood of sequencing datacontinues to overwhelm the processing capacity of existing algorithms and hard-ware. To analyze a patient’s genome, each of these segments -called reads - mustbe mapped to a reference genome based on the similarity between a read and“candidate” locations in that reference genome. The similarity measurement,called alignment, formulated as an approximate string matching problem, is thecomputational bottleneck because: (1) it is implemented using quadratic-timedynamic programming algorithms, and (2) the majority of candidate locationsin the reference genome do not align with a given read due to high dissimilar-ity. Calculating the alignment of such incorrect candidate locations consumes anoverwhelming majority of a modern read mapper’s execution time. Therefore, itis crucial to develop a fast and effective filter that can detect incorrect candidatelocations and eliminate them before invoking computationally costly alignmentalgorithms.In this thesis, we introduce four new algorithms that function as a pre-alignment step and aim to filter out most incorrect candidate locations. Wecall our algorithms GateKeeper, Shouji, MAGNET, and SneakySnake. The firstkey idea of our proposed pre-alignment filters is to provide high filtering accuracyby correctly detecting all similar segments shared between two sequences.iiivThe second key idea is to exploit the massively parallel architecture of modernFPGAs for accelerating our four proposed filtering algorithms. We also developan efficient CPU implementation of the SneakySnake algorithm for commoditydesktops and servers, which are largely available to bioinformaticians without thehassle of handling hardware complexity. We evaluate the benefits and downsidesof our pre-alignment filtering approach in detail using 12 real datasets across dif-ferent read length and edit distance thresholds. In our evaluation, we demonstratethat our hardware pre-alignment filters show two to three orders of magnitudespeedup over their equivalent CPU implementations. We also demonstrate thatintegrating our hardware pre-alignment filters with the state-of-the-art read align-ers reduces the aligner’s execution time by up to 21.5x. Finally, we show thatefficient CPU implementation of pre-alignment filtering still provides significantbenefits. We show that SneakySnake on average reduces the execution time ofthe best performing CPU-based read aligners Edlib and Parasail, by up to 43xand 57.9x, respectively. The key conclusion of this thesis is that developing a fastand efficient filtering heuristic, and developing a better understanding of its ac-curacy together leads to significant reduction in read alignment’s execution time,without sacrificing any of the aligner’ capabilities. We hope and believe that ournew architectures and algorithms catalyze their adoption in existing and futuregenome analysis pipelines.
Keywords:
Read Mapping, Approximate String Matching, Read Alignment, Lev-enshtein Distance, String Algorithms, Edit Distance, fast pre-alignment filter,field-programmable gate arrays (FPGA).
OZETYAS¸AMIN KODUNU ANLAMAYI DAHA ˙IY˙IALGOR˙ITMALAR VE DONANIM TASARIMLARIYLAHIZLANDIRMAK
Mohammed H. K. AlserBilgisayar M¨uhendisli˘gi, DoktoraTez Danı¸smanı: Can Alkan˙Ikinci Tez Danı¸smanı: Onur MutluHaziran 2018G¨un¨um¨uzde insan genomları konusundaki anlayı¸sımız, modern bili¸sim teknolo-jisinin bir bireyin t¨um genomunu hızlı ve do˘gru bir ¸sekilde belirleyebilmeyetene˘ginden etkilenmektedir. Ge¸cti˘gimiz on yıl boyunca, y¨uksek verimli dizileme(HTS) teknolojileri, zaman ve maliyette ¨onemli bir azalma ile birlikte, tekbir ¸calı¸smada y¨uz milyonlardan milyarlarcaya kadar DNA par¸cası ¨uretme ka-biliyeti sayesinde dikkat ¸cekici biyomedikal ke¸siflere kapı a¸cmı¸stır. Ancak,bu dizileme verisi bollu˘gu mevcut algoritmaların ve donanımların i¸slem kapa-sitelerinin sınırlarını zorlamaya devam etmektedir. Bir hastanın genomunu analizetmek i¸cin, ”okuma” adı verilen bu par¸caların her biri referans genomundaki adayb¨olgelerle olan benzerliklerine bakılarak, referans genomu ¨uzerine yerle¸stirilir.Yakla¸sık karakter dizgisi e¸sle¸stirme problemi ¸seklinde form¨ule edilen ve hizalamaolarak adlandırılan benzerlik hesaplaması, i¸slemsel bir darbo˘gazdır ¸c¨unk¨u: (1)ikinci dereceden devingen programlama algoritmaları kullanılarak hesaplanır ve(2) referans genomundaki aday b¨olgelerin b¨uy¨uk bir b¨ol¨um¨u ile verilen okumapar¸cası birbirlerinden y¨uksek d¨uzeyde farklılık g¨osterdiklerinden dolayı hiza-lanamaz. Bu ¸sekilde yanlı¸s belirlenen aday b¨olgelerin hizalanabilirli˘gin hesa-planması, g¨un¨um¨uzdeki okuma haritalandırıcı algoritmaların ¸calı¸sma s¨urelerininb¨uy¨uk b¨ol¨um¨un¨u olu¸sturmaktadır. Bu nedenle, hesaplama olarak maliyetli buhizalama algoritmalarını ¸calı¸stırmadan ¨once, do˘gru olmayan aday b¨olgeleri tespitedebilen ve bu b¨olgeleri aday b¨olge olmaktan ¸cıkaran, hızlı ve etkili bir filtregeli¸stirmek ¸cok ¨onemlidir.Bu tezde, ¨on hizalama a¸saması olarak i¸slev g¨oren ve yanlı¸s aday konumlarının¸co˘gunu filtrelemeyi hedefleyen d¨ort yeni algoritma sunuyoruz. AlgoritmalarımızıGateKeeper, Shouji, MAGNET ve SneakySnake olarak adlandırıyoruz.vi¨Onerilen ¨on hizalama filtrelerinin ilk temel fikri, iki dizi arasında payla¸sılant¨um benzer segmentleri do˘gru bir ¸sekilde tespit ederek y¨uksek filtrelemedo˘grulu˘gu sa˘glamaktır. ˙Ikinci temel fikir, ¨onerilen d¨ort filtreleme algoritmamızınhızlandırılması i¸cin modern FPGA’ların ¸cok b¨uy¨uk ¨ol¸cekte paralel mimarisinikullanmaktır. SneakySnake’i esas olarak biyoinformatisyenlerin mevcut olan, do-nanım karma¸sıklı˘gı ile u˘gra¸smak zorunda olmadıkları emtia masa¨ust¨u ve sunucu-larında kullanabilmeleri i¸cin geli¸stirdik. ¨On okuma filtreleme yakla¸sımımızınavantaj ve dezavantajlarını 12 ger¸cek veri setini, farklı okuma uzunlukları vemesafe e¸sikleri kullanarak ayrıntılı olarak de˘gerlendirdik. De˘gerlendirmemizde,donanım ¨on hizalama filtrelerimizin e¸sde˘ger CPU uygulamalarına g¨ore iki ila¨u¸c derece hızlı olduklarını g¨osteriyoruz. Donanım ¨on hizalama filtrelerim-izi son teknoloji okuma hizalayıcılarıyla entegre etmenin hizalayıcının ¸calı¸smas¨uresini d¨uzenleme mesafesi e¸si˘gine ba˘glı olarak 21.5x. Son olarak, ¨on hiza-lama filtrelerinin etkin CPU uygulamasının hala ¨onemli faydalar sa˘gladı˘gınıg¨osteriyoruz. SneakySnake’in en iyi performansa sahip CPU tabanlı okumaayarlayıcıları Edlib ve Parasail’in y¨ur¨utme s¨urelerini sırasıyla 43x ve 57,9x’ekadar azalttı˘gını g¨osteriyoruz. Bu tezin ana sonucu, hızlı ve verimli bir fil-treleme mekanizması geli¸stirilmesi ve bu mekanizmanın do˘grulu˘gunun daha iyianla¸sılması, hizalayıcıların yeteneklerinden hi¸cbir ¸sey ¨od¨un vermeden, okumahizalamasının ¸calı¸sma s¨uresinde ¨onemli bir azalmaya yol a¸cmaktadır. Yeni mimar-ilerimizin ve algoritmalarımızın, mevcut ve gelecekteki genom analiz planlarındabenimsenmelerini katalize etti˘gimizi umuyor ve buna inanıyoruz.
Anahtar s¨ozc¨ukler : Haritalamayı Oku, Yakla¸sık Dize E¸sle¸stirme, HizalamayıOku, Levenshtein Mesafesi, String Algoritmaları, Mesafeyi D¨uzenle, hızlı ¨on hiza-lama filtresi, Alan programlanabilir kapı dizileri (FPGA). cknowledgement
The last nearly four years at Bilkent University have been most exciting andfruitful time of my life, not only in the academic arena, but also on a personal level.For that, I would like to seize this opportunity to acknowledge all people whohave supported and helped me to become who I am today. First and foremost,the greatest of all my appreciation goes to the almighty Allah for granting mecountless blessings and helping me stay strong and focused to accomplish thiswork.I would like to extend my sincere thanks to my advisor Can Alkan, who intro-duced me to the realm of bioinformatics. I feel very privileged to be part of hislab and I am proud that I am going to be his first PhD graduate. Can generouslyprovided the resources and the opportunities that enabled me to publish and col-laborate with researchers from other institutions. I appreciate all his support andguidance at key moments in my studies while allowing me to work independentlywith my own ideas. I truly enjoyed the memorable time we spent together inAnkara, Izmir, Istanbul, Vienna, and Los Angeles. Thanks to him, I get addictedto the dark chocolate with almonds and sea salt.I am also very thankful to my co-advisor Onur Mutlu for pushing my academicwriting to a new level. It is an honor that he also gave me the opportunity to joinhis lab at ETH Z¨urich as a postdoctoral researcher. His dedication to research iscontagious and inspirational.I am grateful to the members of my thesis committee: Ergin Atalar, MehmetKoyut¨urk, Nurcan Tun¸cba˘g, O˘guz Ergin, and Ozcan Ozt¨urk for their valuablecomments and discussions.I would also like to thank Eleazar Eskin and Jason Cong for giving me theopportunity to join UCLA during the summer of 2017 as a staff research associate.I would like to extend my thanks to Serghei Mangul, David Koslicki, FarhadHormozdiari, and Nathan LaPierre who provided the guidance to make my visitfruitful and enjoyable. I am also thankful to Akash Kumar for giving me thechance to join the Chair for Processor Design during the winter of 2017-2018 asa visiting researcher at TU Dresden. viiiiiI would like to acknowledge Nordin Zakaria and Izzatdin B A Aziz for providingme the opportunity to continue carrying out PETRONAS projects while i am inTurkey. I am grateful to Nor Hisham Hamid for believing in me, hiring me asa consultant for one of his government-sponsored projects, and inviting me toMalaysia. It was my great fortune to work with them.I want to thank all my co-authors and colleagues: Eleazar Eskin, Erman Ayday,Hongyi Xin, Hasan Hassan, Serghei Mangul, O˘guz Ergin, Akash Kumar, NourAlmadhoun, Jeremie Kim, and Azita Nouri.I would like to acknowledge all past and current members of our researchgroup for being both great friends and colleagues to me. Special thanks to FatmaKahveci for being a great labmate who taught me her native language and tol-erated me for many years. Thanks to G¨ulfem Demir for all her valuable supportand encouragement. Thanks to Shatlyk Ashyralyyev for his great help when Iprepared for my qualifying examination and for many valuable discussions onresearch and life. Thanks to Handan Kulan for her friendly nature and pricelesssupport. Thanks to Can Firtina and Damla S¸enol Cali (CMU) for all the help intranslating the abstract of this thesis to Turkish. I am grateful to other membersfor their companionship: Marzieh Eslami Rasekh, Azita Nouri, Elif Dal, Z¨ulalBing¨ol, Ezgi Ebren, Arda S¨oylev, Halil ˙Ibrahim ¨Ozercan, Fatih Karao˘glano˘glu,Tu˘gba Do˘gan, and Balanur ˙I¸cen.I would like to express my deepest gratitude to Ahmed Nemer Almadhoun,Khaldoun Elbatsh (and his wonderful family), Hamzeh Ahangari, Soe Thane,Mohamed Meselhy Eltoukhy (and his kind family), Abdel-Haleem Abdel-Aty,Heba Kadry, and Ibrahima Faye for all the good times and endless generoussupport in the ups and downs of life. I feel very grateful to all my inspiringfriends who made Bilkent a happy home for me, especially Tamer Alloh, BasilHeriz, Kadir Akbudak, Amirah Ahmed, Zahra Ghanem, Muaz Draz, Nabeel AbuBaker, Maha Sharei, Salman Dar, Elif Do˘gan Dar, Ahmed Ouf, Shady Zahed,Mohammed Tareq, Abdelrahman Teskyeh, Obada and many others.My graduate study is an enjoyable journey with my lovely wife,Nour Almadhoun, and my sons, Hasan and Adam. Their love andnever ending support always give me the internal strength to move onwhenever I feel like giving up or whenever things become too hard.xPursuing PhD together with Nour in the same department is our most preciousmemory in lifetime. We could not have been completed any of our PhD workwithout continuously supporting and encouraging each other.My family has been a pillar of support, encouragement and comfort all through-out my journey. Thanks to my parents, Hasan Alser and Itaf Abdelhadi, whoraised me with a love of science and supported me in all my pursuits. Thanks tomy lovely sister Deema and my brothers Ayman (and his lovely family), Hani, So-haib, Osaid, Moath, and Ayham for always believing in me and being by my side.Thanks to my parents-in-law, Mohammed Almadhoun and Nariman Almadhoun,my brotherss-in-law, Ahmed Nemer, Ezz Eldeen, Moatasem, Mahmoud, sisters-in-law, Narmeen, Rania, and Eman, and their families for all their understanding,great support, and love.Finally, I gratefully acknowledge the funding sources that made my PhD workpossible. I was honored to be a T ¨UB˙ITAK Fellow for 4 years offered by the Scien-tific and Technological Research Council of Turkey under 2215 program. In 2017,I was also honored to receive the HiPEAC Collaboration Grant. I am gratefulto Bilkent University for providing generous financial support and funding myacademic visits. This thesis was supported by NIH Grant (No. HG006004) toOnur Mutlu and Can Alkan and a Marie Curie Career Integration Grant (No.PCIG-2011-303772) to Can Alkan under the Seventh Framework Programme.
Mohammed H. K. AlserJune 2018, Ankara, Turkey ontents
ONTENTS xi3.3 Discussion on Improving the Filtering Accuracy . . . . . . . . . . 343.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ONTENTS xii7.2.1 Method 1: Building the Neighborhood Map . . . . . . . . 667.2.2 Method 2: Identifying the E +1 Identical Subsequences . . 667.2.3 Method 3: Filtering Out Dissimilar Sequences . . . . . . . 687.3 Analysis of MAGNET algorithm . . . . . . . . . . . . . . . . . . . 697.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 ONTENTS xiii
10 Conclusions and Future Directions 118
11 Other Works of This Author 122A Data 141 ist of Figures
IST OF FIGURES xv2.1 Timeline of read mappers. CPU-based mappers are plotted inblack, GPU accelerated mappers in red, FPGA accelerated map-pers in blue and SSE-based mappers in green. Grey dotted linesconnect related mappers (extensions or new versions). The namesin the timeline are exactly as they appear in publications, except:SOAP3-FPGA [103], BWA-MEM-FPGA [55], BFAST-Olson [104],BFAST-Yus [105], BWA-Waidyasooriya [58], and BWA-W [68]. . . 183.1 An example of a mapping with all its generated masks, where theedit distance threshold ( E ) is set to 3. The green highlighted sub-sequences are part of the correct alignment. The red highlightedbit in the final bit-vector is a wrong alignment provided by SHD.The correct alignment (highlighted in yellow) shows that there arethree substitutions and a single deletion, while SHD detects onlytwo substitutions and a single deletion. . . . . . . . . . . . . . . . 293.2 Examples of an incorrect mapping that passes the SHD filter dueto the random zeros. While the edit distance threshold is 3, a map-ping of 4 edits (as examined at the end of the figure by Needleman-Wunsch algorithm) passes as a falsely accepted mapping. . . . . . 303.3 An example of an incorrect mapping that passes the SHD filterdue to conservative counting of the short streak of ‘1’s in the finalbit-vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Examples of an invalid mapping that passes the SHD filter due tothe leading and trailing zeros. We use an edit distance threshold of3 and an SRS threshold of 2. While the regions that are highlightedin green are part of the correct alignment, the wrong alignmentprovided by SHD is highlighted in red. The yellow highlighted bitsindicate a source of falsely accepted mapping. . . . . . . . . . . . 323.5 Examples of incorrect mappings that pass the SHD filter due to(a) overlapping identical subsequences, and (b) lack of backtracking. 334.1 Overview of our accelerator architecture. . . . . . . . . . . . . . . 384.2 Xilinx Virtex-7 FPGA VC709 Connectivity Kit and chip layout ofour implemented accelerator. . . . . . . . . . . . . . . . . . . . . . 39 IST OF FIGURES xvi5.1 A flowchart representation of the GateKeeper algorithm. . . . . . 445.2 An example showing how various types of edits affect the align-ment of two reads. In (a) the upper read exactly matches thelower read and thus each base exactly matches the correspondingbase in the target read. (b) shows base-substitutions that onlyaffect the alignment at their positions. (c) and (d) demonstrateinsertions and deletions, respectively. Each edit has an influenceon the alignment of all the subsequent bases. . . . . . . . . . . . . 455.3 Workflow of the proposed architecture for the parallel amendmentoperations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4 An example of applying our solution for reducing the number ofbits of each Hamming mask by half. We use a modified Hammingmask to store the result of applying the bitwise OR operation toeach two bits of the Hamming mask. The modified mask maintainsthe same meaning of the original Hamming mask. . . . . . . . . . 486.1 Random edit distribution in a read sequence. The edits ( e , e ,. . . , e E ) act as dividers resulting in several identical subsequences( m , m , . . . , m E +1 ) between the read and the reference. . . . . . 536.2 Neighborhood map ( N ) and the Shouji bit-vector, for text T =GGTGCAGAGCTC, and pattern P = GGTGAGAGTTGT for E =3. The three common subsequences (i.e., GGTG, AGAG, andT) are highlighted in yellow. We use a search window of size 4columns (two examples of which are high-lighted in red) with astep size of a single column. Shouji searches diagonally within eachsearch window for the 4-bit vector that has the largest numberof zeros. Once found, Shouji examines if the found 4-bit vectormaximizes the number of zeros at the corresponding location ofthe 4-bit vector in the Shouji bit-vector. If so, then Shouji storesthis 4-bit vector in the Shouji bit-vector at its corresponding location. 55 IST OF FIGURES xvii6.3 The effect of the window size on the rate of the falsely acceptedsequences (i.e., dissimilar sequences that are considered as simi-lar ones by Shouji filter). We observe that a window width of 4columns provides the highest accuracy. We also observe that aswindow size increases beyond 4 columns, more similar sequencesare rejected by Shouji, which should be avoided. . . . . . . . . . . 586.4 An example of applying Shouji filtering algorithm to a sequencepair, where the edit distance threshold is set to 4. We present thecontent of the neighborhood map along with the Shouji bit-vector.We apply the Shouji algorithm starting from the leftmost columntowards the rightmost column. . . . . . . . . . . . . . . . . . . . . 596.5 A block diagram of the sliding window scheme implemented inFPGA for a single filtering unit. . . . . . . . . . . . . . . . . . . . 627.1 The false accept rate of MAGNET using two different algorithmsfor identifying the identical subsequences. We observe that findingthe E +1 non-overlapping identical subsequences leads to a signif-icant reduction in the incorrectly accepted sequences compared tofinding the top E +1 longest identical subsequences. . . . . . . . . 687.2 An example of applying MAGNET filtering algorithm with anedit distance threshold of 4. MAGNET finds all the longest non-overlapping subsequences of consecutive zeros in the descendingorder of their length (as numbered in yellow). . . . . . . . . . . . 698.1 Weighted Neighborhood maze ( WN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGTfor E =3. The snake traverses the path that is highlighted in yellowwhich includes three obstacles and three pipes, that representsthree identical subsequences: GGTG, AGAG, and T . . . . . . . . 788.2 Unweighted neighborhood maze ( UN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGTfor E =3. The snake traverses the path that is highlighted in yellowwhich includes three obstacles and three pipes, that representsthree identical subsequences: GGTG, AGAG, and T. . . . . . . . 80 IST OF FIGURES xviii8.3 Proposed 4-bit LZC comparator. . . . . . . . . . . . . . . . . . . 858.4 Block diagram of the 11 LZCs and the hierarchical LZC compara-tor tree for computing the largest number of leading zeros in 11rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869.1 The effects of column-wise partitioning the search space ofSneakySnake algorithm on the average false accept rate ((a), (c),and (e)) and the average execution time ((b), (d), and (f)) of ex-amining set 1 to set 4 in (a) and (b), set 5 to set 8 in (c) and (d),and set 9 to set 12 in (e) and (f). Besides the default size (equalsthe read length) of the SneakySnake’s unweighted neighborhoodmaze, we choose partition sizes (the number of grid’s columns thatare included in each partition) of 5, 10, 25, and 50 columns. . . . 969.2 The effects of the number of replications of the hardware imple-mentation of SneakySnake algorithm on its filtering accuracy (falseaccept rate). We use a wide range of edit distance thresholds (0bp-10 bp for a read length of 100 bp) and four datasets: (a) set 1,(b) set 2, (c) set 3, and (d) set 4. . . . . . . . . . . . . . . . . . . 979.3 The effects of computing the prefix edit distance on the overallexecution time of the edit distance calculations compared to theoriginal Edlib (exact edit distance) and our partitioned implemen-tation of SneakySnake algorithm. We present the average timespent in examining set 1 to set 4 in (a), set 5 to set 8 in (b), andset 9 to set 12 in (c). We choose initial sub-matrix sizes of 5, 10,25, and 50 columns. We mark the intersection of SneakySnake-5and Edlib-50 plots with a dashed vertical line. . . . . . . . . . . . 999.4 The false accept rate produced by our pre-alignment filters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to thebest performing filter, SHD [65]. We use a wide range of edit dis-tance thresholds (0-10 edits for a read length of 100 bp) and fourdatasets: (a) set 1, (b) set 2, (c) set 3, and (d) set 4. . . . . . . . 101
IST OF FIGURES xix9.5 The false accept rate produced by our pre-alignment filters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to thebest performing filter, SHD [65]. We use a wide range of edit dis-tance thresholds (0-15 edits for a read length of 150 bp) and fourdatasets: (a) set 5, (b) set 6, (c) set 7, and (d) set 8. . . . . . . . 1029.6 The false accept rate produced by our pre-alignment filters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to thebest performing filter, SHD [65]. We use a wide range of edit dis-tance thresholds (0-25 edits for a read length of 250 bp) and fourdatasets: (a) set 9, (b) set 10, (c) set 11, and (d) set 12. . . . . . 1039.7 An example of a falsely-rejected mapping using MAGNET algo-rithm for an edit distance threshold of 6. The random zeros (high-lighted in red) confuse MAGNET filter causing it to select shortersegments of random zeros instead of a longer identical subsequence(highlighted in blue). . . . . . . . . . . . . . . . . . . . . . . . . . 1049.8 End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment filters. We use fourdatasets ((a) set 1, (b) set 2, (c) set 3, and (d) set 4) across dif-ferent edit distance thresholds. We highlight in a dashed verticalline the edit distance threshold where Edlib starts to outperformour SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 1109.9 End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment filters. We use fourdatasets ((a) set 5, (b) set 6, (c) set 7, and (d) set 8) across dif-ferent edit distance thresholds. We highlight in a dashed verticalline the edit distance threshold where Edlib starts to outperformour SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 1119.10 End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment filters. We use fourdatasets ((a) set 9, (b) set 10, (c) set 11, and (d) set 12) acrossdifferent edit distance thresholds. We highlight in a dashed verticalline the edit distance threshold where Edlib starts to outperformour SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 112
IST OF FIGURES xx9.11 End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment filters. We use fourdatasets ((a) set 1, (b) set 2, (c) set 3, and (d) set 4) across dif-ferent edit distance thresholds. . . . . . . . . . . . . . . . . . . . . 1139.12 End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment filters. We use fourdatasets ((a) set 5, (b) set 6, (c) set 7, and (d) set 8) across dif-ferent edit distance thresholds. . . . . . . . . . . . . . . . . . . . . 1149.13 End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment filters. We use fourdatasets ((a) set 9, (b) set 10, (c) set 11, and (d) set 12) acrossdifferent edit distance thresholds. . . . . . . . . . . . . . . . . . . 1159.14 Memory utilization of Edlib (path) read aligner while evaluatingset 12 for an edit distance threshold of 25. . . . . . . . . . . . . . 1169.15 Memory utilization of exact edit distance algorithm (Edlib ED)combined with Edlib (path) read aligner while evaluating set 12for an edit distance threshold of 25. . . . . . . . . . . . . . . . . . 1169.16 Memory utilization of SneakySnake-5 combined with Edlib (path)read aligner while evaluating set 12 for an edit distance thresholdof 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 ist of Tables -e parameter). . . . . . . . . . . . 919.3 FPGA resource usage for a single filtering unit of GateKeeper,Shouji, MAGNET, and SneakySnake for a sequence length of 100and under different edit distance thresholds ( E ). . . . . . . . . . . 939.4 The execution time (in seconds) of GateKeeper, MAGNET, Shouji,and SneakySnake under different edit distance thresholds. We useset 1 to set 4 with a read length of 100. We provide the per-formance results for the CPU implementations and the hardwareaccelerators with the maximum number of filtering units. . . . . . 1069.5 End-to-end execution time (in seconds) for several state-of-the-artsequence alignment algorithms, with and without pre-alignmentfilters (SneakySnake, Shouji, MAGNET, GateKeeper, and SHD)and across different edit distance thresholds. We use four datasets(set 1, set 2, set 3, and set 4) across different edit distance thresh-olds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108xxi IST OF TABLES xxiiA.1 Details of our first four datasets (set 1, set 2, set 3, and set 4). Weuse Edlib [78] (edit distance mode) to benchmark the accepted andthe rejected pairs for edit distance thresholds of 0 up to 10 edits. . 142A.2 Details of our second four datasets (set 5, set 6, set 7, and set 8).We use Edlib [78] (edit distance mode) to benchmark the acceptedand the rejected pairs for edit distance thresholds of 0 up to 15 edits.143A.3 Details of our last four datasets (set 9, set 10, set 11, and set 12).We use Edlib [78] (edit distance mode) to benchmark the acceptedand the rejected pairs for edit distance thresholds of 0 up to 25 edits.143A.4 Details of evaluating the feasibility of reducing the search spacefor SneakySnake and Edlib, evaluated using set 1, set 2, set 3, andset 4 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.5 Details of evaluating the feasibility of reducing the search spacefor SneakySnake and Edlib, evaluated using set 5, set 6, set 7, andset 8 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.6 Details of evaluating the feasibility of reducing the search space forSneakySnake and Edlib, evaluated using set 9, set 10, set 11, andset 12 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 hapter 1Introduction
Genome is the code of life that includes set of instructions for making everythingfrom humans to elephants, bananas, and yeast. Analyzing the life’s code helps,for example, to determine differences in genomes from human to human thatare passed from one generation to the next and may cause diseases or differenttraits [1, 2, 3, 4, 5, 6]. One benefit of knowing the genetic variations is betterunderstanding and diagnosis diseases such as cancer and autism [7, 8, 9, 10]and the development of efficient drugs [11, 12, 13, 14, 15]. The first step ingenome analysis is to reveal the entire content of the subject genome – a processknown as DNA sequencing [16]. Until today, it remains challenging to sequencethe entire DNA molecule as a whole. As a workaround, high throughput DNAsequencing (HTS) technologies are used to sequence random fragments of copiesof the original molecule. These fragments are called short reads and are 75-300basepairs (bp) long. The resulting reads lack information about their order andorigin (i.e., which part of the subject genome they are originated from). Hence themain challenge in genome analysis is to construct the donor’s complete genomewith respect to a reference genome. 1uring a process, called read mapping, each read is mapped onto one or morepossible locations in the reference genome based on the similarity between theread and the reference sequence segment at that location (like solving a jigsawpuzzle). The similarity measurement is referred to as optimal read alignment (i.e.,verification) and could be calculated using the Smith-Waterman local alignmentalgorithm [17]. It calculates the alignment that is an ordered list of charactersrepresenting possible edit operations and matches required to change one of thetwo given sequences into the other. Commonly allowed edit operations includedeletion, insertion and substitution of characters in one or both sequences. Asany two sequences can have several different arrangements of the edit operationsand matches (and hence different alignments), the alignment algorithm usuallyinvolves a backtracking step. This step finds the alignment that has the highestalignment score (called optimal alignment). The alignment score is the sumof the scores of all edits and matches along the alignment implied by a user-defined scoring function. However, this approach is infeasible as it requires
O(mn) running time, where m is the read length (few hundreds of bp) and n is thereference length ( ∼ seed-and-extend . In this strategy, amapper applies heuristics to first find candidate map locations (seed locations)of subsequences of the reads using hash tables (BitMapper [18], mrFAST withFastHASH [19], mrsFAST [20]) or BWT-FM indices (BWA-MEM [21], Bowtie 2[22], SOAP3-dp [23]). It then aligns the read in full only to those seed locations.Although the strategies for finding seed locations vary among different read map-ping algorithms, seed location identification is typically followed by alignmentstep. The general goal of this step is to compare the read to the reference seg-ment at the seed location to check if the read aligns to that location in the genomewith fewer differences (called edits ) than a threshold [24].2igure 1.1: Rate of verified and unverified read-reference pairs that are generatedby mrFAST mapper and are fed into its read alignment algorithm. We set thethreshold to 2, 3, and 5 edits. The alignment step is the performance bottleneck of today’s read map-per taking over 70% to 90% of the total running time [25, 18, 19] . Wepinpoint three specific problems that cause, affect, or exacerbate the long align-ment’s execution time.(1) We find that across our data set (see
Chapter 9 ), an overwhelming ma-jority (more than 90% as we present in Figure 1.1) of the seed locations, thatare generated by a state-of-the-art read mapper, mrFAST with FastHASH [19],exhibit more edits than the allowed threshold. These particular seed locationsimpose a large computational burden as they waste 90% of the alignment’s exe-cution time in verifying these incorrect mappings. This observation is also in linewith similar results for other read mappers [19, 25, 18, 26].(2) Typical read alignment algorithm also needs to tolerate sequencing errors [27, 28] as well as genetic variations [29]. The read alignment approach is non-additive measure [30]. This means that if we divide the sequence pair into twoconsecutive subsequence pairs, the edit distance of the entire sequence pair isnot necessarily equivalent to the sum of the edit distances of the shorter pairs.Instead, we need to examine all possible prefixes of the two input sequences andkeep track of the pairs of prefixes that provide an optimal solution. Therefore,3he read alignment step is implemented using dynamic programming algorithmsto avoid re-examining the same prefixes many times. This includes Levenshteindistance [31], Smith-Waterman [17], Needleman-Wunsch [32] and their improvedimplementations. These algorithms are inefficient as they run in a quadratic-timecomplexity in the read length, m , (i.e., O(m ) ).(3) This computational burden is further aggravated by the unprecedentedflood of sequencing data which continues to overwhelm the processing capacityof existing algorithms and compute infrastructures [33]. While today’s HTS ma-chines (e.g., Illumina HiSeq4000) can generate more than 300 million bases perminute, state-of-the-art read mapper can only map 1% of these bases per minute[34]. The situation gets even worse when one tries to to understand a complexdisease (e.g., autism and cancer) [8, 9, 35, 36, 37, 38, 39] or profile a metage-nomics sample [40, 41, 42, 43], which requires sequencing hundreds of thousandsof genomes. The long execution time of modern-day read alignment can severelyhinder such studies. There is also an urgent need for rapidly incorporating clinicalsequencing into clinical practice for diagnosis of genetic disorders in critically illinfants [44, 45, 46, 47]. While early diagnosis in such infants shortens the clinicalcourse and enables optimal outcomes [48, 49, 50], it is still challenging to de-liver efficient clinical sequencing for tens to hundreds of thousands of hospitalizedinfants each year [51].Tackling these challenges and bridging the widening gap between the executiontime of read alignment and the increasing amount of sequencing data necessitatethe development of fundamentally new, fast, and efficient read alignment algo-rithms. In the next sections, we provide the motivation behind our proposedwork to considerably boost the performance of read alignment. We also providefurther background information and literature study in Chapter 2.4 .2 Motivation We present in Figure 1.2 the flow chart of a typical seed-and-extend based mapperduring the mapping stage. The mapper follows five main steps to map a readset to the reference genome sequence. (1) In step 1, typically a mapper firstconstructs fast indexing data structure (e.g., large hash table) for short segments(called seeds or k-mers or q-maps) of the reference sequence. (2) In step 2, themapper extracts short subsequences from a read and uses them to query the hashtable. (3) In step 3, The hash table returns all the occurrence hits of each seedin the reference genome. Modern mappers employ seed filtering mechanism toreduce the false seed locations that leads to incorrect mappings. (4) In step 4, foreach possible location in the list, the mapper retrieves the corresponding referencesegment from the reference genome based on the seed location. The mappercan then examine the alignment of the entire read with the reference segmentusing fast filtering heuristics that reduce the need for the dynamic programmingalgorithms. It rejects the mappings if the read and the reference are obviouslydissimilar.Otherwise, the mapper proceeds to the next step. (5) In step 5, the map-per calculates the optimal alignment between the read sequence and the ref-erence sequence using a computationally costly sequence alignment (i.e., veri-fication) algorithm to determine the similarity between the read sequence andthe reference sequence. Many attempts were made to tackle the computation-ally very expensive alignment problem. Most existing works tend to follow oneof three key directions: (1) accelerating the dynamic programming algorithms[52, 53, 54, 55, 56, 57, 58, 23, 59, 60], (2) developing seed filters that aim toreduce the false seed locations [18, 19, 26, 61, 62, 63, 64], and (3) developingpre-alignment filtering heuristics [65].The first direction takes advantage of parallelism capabilities of high perfor-mance computing platforms such as central processing units (CPUs) [54, 56],graphics processing units (GPUs) [57, 23, 59], and field-programmable gate ar-rays (FPGAs) [52, 53, 66, 67, 55, 58, 60]. Among these computing platforms,5PGA accelerators seem to yield the highest performance gain [53, 68]. However,many of these efforts either simplify the scoring function, or only take into accountaccelerating the computation of the dynamic programming matrix without pro-viding the optimal alignment (i.e., backtracking) as in [66, 67]. Different scoringfunctions are typically needed to better quantify the similarity between the readand the reference sequence segment [69]. The backtracking step required for opti-mal alignment computation involves unpredictable and irregular memory accesspatterns, which poses difficult challenge for efficient hardware implementation.The second direction to accelerate read alignment is to use filtering heuristicsto reduce the size of the seed location list. This is the basic principle of nearlyall mappers that employ seed-and-extend approach. Seed filter applies heuristicsto reduce the output location list. The location list stores all the occurrencelocations of each seed in the reference genome. The returned location list can betremendously large as a mapper searches for an exact matches of short segment(typically 10 bp -13 bp for hash-based mappers) between two very long homolo-gous genomes [70]. Filters in this category suffer from low filtering accuracy asthey can only look for exact matches with the help of hash table. Thus, theyquery a few number of seeds per read (e.g., in Bowtie 2 [22], it is 3 fixed lengthseeds at fixed locations) to maintain edit tolerance. mrFAST [19] uses anotherapproach to increase the seed filtering accuracy by querying the seed and itsshifted copies. This idea is based on the observation that indels cause the trailingcharacters to be shifted to one direction. If one of the shifted copies of the seed,generated from the read sequence, or the seed itself matches the correspondingseed from the reference, then this seed has zero edits. Otherwise, this approachcalculates the number of edits in this seed as a single edit (that can be a singleindel or a single substitution). Therefore, this approach fails to detect the correctnumber of edits for these case, for example, more than one substitutions in thesame seed, substitutions and indel in the same seed, or more than one indels inthe last seed). Seed filtering successes to eliminate some of the incorrect locationsbut it is still unable to eliminate sufficiently enough large portion of the false seedlocations, as we present in Figure 1.1. 6 A ACGTACGTACGTACGT
TATATATACGTACTAGTACGT
ACGACTTTAGTACGTACGT
TATATATACGTACTAGTACGT
ACGTACGCCCCTACGTA
ACGACTTTAGTACGTACGT
TATATATACGTACTAAAGTACGT
CCCCCCTATATATACGTACTAGTACGTTATATATACGTACTAGTACGT
TATATATACGTACTAGTACGT
ACGTTTTTAAAACGTA
ACGACGGGGAGTACGTACGT
TATATATACGTACTAAAGTACGT seed ... ...
Reference GenomeBillions of Short Read
High throughput DNA sequencing (HTS) technologies ... ...
Reference Genome L1 Reference genome indexing Hash table or BWT-FM L2L3 L4L5L6L7 L10L11L12L13L22L9L8L15L20L14 L19L18L16L17L26
TATATATACGTACTAGTACGTCCTATATATACGTACTAGTACGTCCC
A C T T A G C A C T A C T A G A A C T T C T A T A A T A C GCATATATACG
L2 L11L8 L18
TATATATACGTACTA GTCCG T
Seed selectionExtracting seeds from read set Seed querying & filtering Each seed has its own location list Read 1:Read 2:Read 3: ... ... ...
CCTATAT TACGTACTAGTACGT T
Read 1:
CTATATA ACGTACTAGTACGTG G CC
Ref. 1 :
L30L37L60
Pre-alignment FilteringExamining each mapping individually Read Alignment Slow and Accurate Seed location listseeds Indexing data structure
L7 L14 L1 L10L9 L19 L30
Exact match seeds from step 3Estimating the total number of edits quickly
L50 AG L3 L4 L12 L16 L37
Finding all locations at reference genome where seeds exact match, then pruning the seed location list
L5 L13L15 L17 L200 .sam file contains necessary alignment information
Figure 1.2: The flow chart of seed-and-extend based mappers that includes fivemain steps. (1) Typical mapper starts with indexing the reference genome usinga scheme based on hash table or BWT-FM. (2) It extracts number of seeds fromeach read that are produced using sequencing machine. (3) It obtains a list ofpossible locations within the reference genome that could result in a match withthe extracted seeds. (4) It applies fast heuristics to examine the alignment ofthe entire read with its corresponding reference segment of the same length. (5)It uses expensive and accurate alignment algorithm to determine the similaritybetween the read and the reference sequences.7 % % % % % % % % % % % % % % % % % % % % %
2x 4x 8x 16x 32x 64x 128x P r o c e ss i n g t i m e ( s e c ) f o r m illi o n m a pp i n g s Pre-alignment filtering accuracy and filtering speed compared to alignment step
End-to-end alignment time without pre-alignment filter (sec)
End-to-end alignment time with pre-alignment filter (sec)
The desired processing time (Our goal)We aim to avoid these two regions
Figure 1.3: End-to-end execution time (in seconds) for read alignment, with (blueplot) and without (green plot) pre-alignment filter. Our goal is to significantlyreduce the alignment time spent on verifying incorrect mappings (highlighted inyellow). We sweep the percentage of rejected mappings and the filtering speedcompared to alignment algorithm in the x-axis.The third direction to accelerate read alignment is to minimize the number ofincorrect mappings on which alignment is performed by incorporating filteringheuristics. This is the last line of defense before invoking computationally expen-sive read alignment. Such filters come into play before read alignment (i.e., hencecalled pre-alignment filter), discarding incorrect mappings that alignment woulddeem a poor match. Though the seed filtering and the pre-alignment filteringhave the same goal, they are fundamentally different problems. In pre-alignmentfiltering approach, a filter needs to examine the entire mapping. They calculatea best guess estimate for the alignment score between a read sequence and a ref-erence segment. If the lower bound exceeds a certain number of edits, indicatingthat the read and the reference segment do not align, the mapping is eliminatedsuch that no alignment is performed. Unfortunately, the best performing exist-ing pre-alignment filter, such as shifted Hamming distance (SHD), is slow and itsmechanism introduces inaccuracy in its filtering unnecessarily as we show in ourstudy in Chapter 3 and in our experimental evaluation, Chapter 9.8 re-alignment filter enables the acceleration of read alignment andmeanwhile offers the ability to make the best use of existing read align-ment algorithms .These benefits come without sacrificing any capabilities of these algorithms,as pre-alignment filter does not modify or replace the alignment step. This mo-tivates us to focus our improvement and acceleration efforts on pre-alignmentfiltering. We analyze in Figure 1.3 the effect of adding pre-alignment filteringstep before calculating the optimal alignment and after generating the seed lo-cations. We make two key observations. (1) The reduction in the end-to-endprocessing time of the alignment step largely depends on the accuracy and thespeed of the pre-alignment filter. (2) Pre-alignment filtering can provide unsatis-factory performance (as highlighted in red) if it can not reject more than about30% of the potential mappings while it’s only 2x-4x faster than read alignmentstep.We conclude that it is important to understand well what makes pre-alignmentfilter inefficient, such that we can devise new filtering technique that is much fasterthan read alignment and yet maintains high filtering accuracy.
Our goal in this thesis is to significantly reduce the time spent on calculating theoptimal alignment in genome analysis from hours to mere seconds, given limitedcomputational resources (i.e., personal computer or small hardware). This wouldmake it feasible to analyze DNA routinely in the clinic for personalized healthapplications. Towards this end, we analyze the mappings that are provided toread alignment algorithm, and explore the causes of filtering inaccuracy. Ourthesis statement is: 9 ead alignment can be substantially accelerated using computation-ally inexpensive and accurate pre-alignment filtering algorithms de-signed for specialized hardware.
Accurate filter designed on a specialized hardware platform can drasticallyexpedite read alignment by reducing the number of locations that must be ver-ified via dynamic programming. To this end, we (1) develop four hardware-acceleration-friendly filtering algorithms and highly-parallel hardware accelera-tor designs which greatly reduce the need for alignment verification in DNA readmapping, (2) introduce fast and accurate pre-alignment filter for general purposeprocessors, and (3) develop a better understanding of filtering inaccuracy andexplore speed/accuracy trade-offs.
The overarching contribution of this thesis is the new algorithms and architecturesthat reduce read alignment’s execution time in read mapping. More specifically,this thesis makes the following main contributions:1. We provide a detailed investigation and analysis of four potential causes offiltering inaccuracy in the state-of-the-art alignment filter, SHD [65]. Wealso provide our recommendations on eliminating these causes and improv-ing the overall filtering accuracy.2.
GateKeeper.
We introduce the first hardware pre-alignment filtering,GateKeeper, which substantially reduces the need for alignment verificationin DNA read mapping. GateKeeper is highly parallel and heavily relies onbitwise operations such as look-up table, shift, XOR, and AND. GateKeepercan examine up to 16 mappings in parallel, on a single FPGA chip with alogic utilization of less than 1% for a single filtering unit. It provides twoorders of magnitude speedup over the state-of-the-art pre-alignment filter,SHD. It also provides up to 13.9x speedup to the state-of-the-art aligners.10ateKeeper is published in Bioinformatics [71] and also available in arXiv[72].3.
Shouji.
We introduce Shouji, a highly accurate and parallel pre-alignmentfilter which uses a sliding window approach to quickly identify dissimilarsequences without the need for computationally expensive alignment al-gorithms. Shouji can examine up to 16 mappings in parallel, on a singleFPGA chip with a logic utilization of up to 2% for a single filtering unit. Itprovides, on average, 1.2x to 1.4x more speedup than what GateKeeper pro-vides to the state-of-the-art read aligners due to its high accuracy. Shoujiis 2.9x to 155x more accurate than GateKeeper. Shouji is published inBioinformatics [73] and also available in arXiv [74].4.
MAGNET.
We introduce MAGNET, a highly accurate pre-alignment fil-ter which employs greedy divide-and-conquer approach for identifying allnon-overlapping long matches between two sequences. MAGNET can ex-amine 2 or 8 mappings in parallel depending on the edit distance threshold,on a single FPGA chip with a logic utilization of up to 37.8% for a singlefiltering unit. MAGNET is, on average, two to four orders of magnitudemore accurate than both Shouji and GateKeeper. This comes at the ex-pense of its filtering speed as it becomes up to 8x slower than Shouji andGateKeeper. It still provides up to 16.6x speedup to the state-of-the-artread aligners. MAGNET is published in IPSI [75], available in arXiv [76],and presented in AACBB2018 [77].11.
SneakySnake.
We introduce SneakySnake, the fastest and the most ac-curate pre-alignment filter. SneakySnake reduces the problem of findingthe optimal alignment to finding a snake’s optimal path (with the leastnumber of obstacles) in linear time complexity in read length. We pro-vide a cost-effective CPU implementation for our SneakySnake algorithmthat accelerates the state-of-the-art read aligners, Edlib [78] and Parasail[54], by up to 43x and 57.9x, respectively, without the need for hardwareaccelerators. We also provide a scalable hardware architecture and hard-ware design optimization for the SneakySnake algorithm in order to furtherboost its speed. The hardware implementation of SneakySnake acceleratesthe existing state-of-the-art aligners by up to 21.5x when it is combinedwith the aligner. SneakySnake is up to one order, four orders, and fiveorders of magnitude more accurate compared to MAGNET, Shouji, andGateKeeper, while preserving all correct mappings. SneakySnake also re-duces the memory footprint of Edlib aligner by 50%. This work is yet tobe published.6. We provide a comprehensive analysis of the asymptotic run time and spacecomplexity of our four pre-alignment filtering algorithms. We perform adetailed experimental evaluation of our proposed algorithms using 12 realdatasets across three different read lengths (100 bp, 150 bp, and 250 bp) anedit distance threshold of 0% to 10% of the read length. We explore differentimplementations for the edit distance problem in order to compare theperformance of SneakySnake that calculate approximate edit distance withthat of the efficient implementation of exact edit distance. This also helpsus to develop a deep understanding of the trade-off between the accuracyand speed of pre-alignment filtering.Overall, we show in this thesis that developing a hardware-based alignment fil-tering algorithm and architecture together is both feasible and effective by build-ing our hardware accelerator on a modern FPGA system. We also demonstratethat our pre-alignment filters are more effective in boosting the overall perfor-mance of the alignment step than only accelerating the dynamic programmingalgorithms by one to two orders of magnitude.12his thesis provides a foundation in developing fast and accurate pre-alignmentfilters for accelerating existing and future read mappers.
This thesis is organized into 11 chapters. Chapter 2 describes the necessarybackground on read mappers and related prior works on accelerating their com-putations. Chapter 3 explores the potential causes of filtering inaccuracy andprovides recommendations on tackling them. Chapter 4 presents the architectureand implementation details of our hardware accelerator that we use for boostingthe speed of our proposed pre-alignment filters. Chapter 5 presents GateKeeperalgorithm and architecture. Chapter 6 presents Shouji algorithm and its hard-ware architecture. Chapter 7 presents MAGNET algorithm and its hardwarearchitecture. Chapter 8 presents SneakySnake algorithm. Chapter 9 presents thedetailed experimental evaluation for all our proposed pre-alignment filters alongwith comprehensive comparison with the state-of-the-art existing works. Chapter10 presents conclusions and future research directions that are enabled by thisthesis. Finally, Appendix A extends Chapter 9 with more detailed informationand additional experimental data/results.13 hapter 2Background
In this chapter, we provide the necessary background on two key read map-ping methods. We highlight the strengths and weaknesses of each method. Wethen provide an extensive literature review on the prior, existing, and recent ap-proaches for accelerating the operations of read mappers. We devote the providedbackground materials only to the reduction of read mapper’s execution time.
With the presence of a reference genome, read mappers maintain a large indexdatabase ( ∼ > version 2.5) [19] first builds a hash table to index fixed-lengthseeds (typically 10-13 bp) from the reference genome . It then applies a seedlocation filtering mechanism, called Adjacency Filter, on the hash table to reducethe false seed locations. It divides each query read into smaller fixed-length seedsto query the hash table for their associated seed locations. Given an edit distancethreshold, Adjacency Filter requires N-E seeds to exactly match adjacent loca-tions, where N is the number of the seeds and E is the edit distance threshold.Finally, mrFAST tries to extend the read at each of the seed locations by align-ing the read to the reference fragment at the seed location via Levenshtein editdistance [31] with Ukkonen’s banded algorithm [101]. One drawback of this seedfiltering is that the presence of one or more substitutions in any seed is countedby the Adjacency Filter as a single mismatch. The effectiveness of the AdjacencyFilter for substitutions and indels diminishes when E becomes larger than 3 edits.16 recent work in [102] shows that by removing redundancies in the refer-ence genome and also across the reads, seed-and-extend mappers can be fasterthan BWT-FM-based mappers. This space-efficient approach uses a similar ideapresented in [94]. A hybrid method that incorporates the advantages of eachapproach can be also utilized, such as BWA-MEM [21]. A majority of read mappers are developed for machines equipped with the general-purpose central processing units (CPUs). As long as the gap between the CPUcomputing speed and the very large amount of sequencing data widens, CPU-based mappers become less favorable due to their limitations in accessing data[52, 53, 54, 55, 56, 57, 58, 23, 59, 60]. To tackle this challenge, many attemptswere made to accelerate the operations of read mapping. We survey in Figure2.1 the existing read mappers implemented in various acceleration platforms.FPGA-based read mappers often demonstrate one to two orders of magnitudespeedups against their GPU-based counterparts [53, 68]. Most of the existingworks used hardware platforms to only accelerate the dynamic programming al-gorithms (e.g., Smith-Waterman algorithm [17]), as these algorithms contributedsignificantly to the overall running time of read mappers. Most existing works canbe divided into three main approaches: (1) Developing seed filtering mechanismto reduce the seed location list, (2) Accelerating the computationally expensivealignment algorithms using algorithmic development or hardware accelerators,and (3) Developing pre-alignment filtering heuristics to reduce the number of in-correct mappings before being examined by read alignment. We describe nexteach of these three acceleration efforts in detail.
The first approach to accelerate today’s read mapper is to filter the seed locationlist before performing read alignment. 17 ears
CPUGPUFPGASSE-SIMDBLAST SSAHABlat MUMmer3ExonerateGMAP GSNAPSOAP SOAP3 SOAP3-dpSOAP2RMAPZOOM ZOOM LiteSeqMapMAQ BLAST+BLASTZMUMmer SOCS Slider SliderIIPASS PASS-bisBowtie SOAP3-FPGA FHASTBowtie 2ProbeMatch WHAMCloudBurstBWA BWA-SW BWA-MEM
BWA-MEM-
FPGA
SHRiMP SHRiMP2SARUMANCUSHAWCUSHAW2 CUSHAW2-GPURazerS RazerS 3BFAST BFAST-Olson BFAST-YusBWA-Waidyasooriya BWA-WmrFAST mrsFAST mrFAST_FastHASH ✦ ● ✹ ■■■■■■■■■■ ■■■■■■■■■■ ■ ■■ ■■■■■ ■■ ■■ ■■■■ ■■ ●● ● ● ✹✹ ✹✹✹✹ ■ ● ✹✦ Figure 2.1: Timeline of read mappers. CPU-based mappers are plotted in black,GPU accelerated mappers in red, FPGA accelerated mappers in blue and SSE-based mappers in green. Grey dotted lines connect related mappers (extensionsor new versions). The names in the timeline are exactly as they appear in pub-lications, except: SOAP3-FPGA [103], BWA-MEM-FPGA [55], BFAST-Olson[104], BFAST-Yus [105], BWA-Waidyasooriya [58], and BWA-W [68].18his is the basic principle of nearly all seed-and-extend mappers. Seed filteringis based on the observation that if two sequences are potentially similar, then theyshare a certain number of seeds. Seeds (sometimes called q-grams or k-mers) areshort subsequences that are used as indices into the reference genome to reducethe search space and speed up the mapping process. Modern mappers extractshort subsequences from each read and use them as a key to query the previouslybuilt large reference index database. The database returns the location lists foreach seed. The location list stores all the occurrence locations of each seed in thereference genome. The mapper then examines the optimal alignment between theread and the reference segment at each of these seed locations. The performanceand accuracy of seed-and-extend mappers depend on how the seeds are selectedin the first stage. Mappers should select a large number of non-overlapping seedswhile keeping each seed as infrequent as possible for full sensitivity [70, 106, 19].There is also a significant advantage to selecting seeds with unequal lengths, aspossible seeds of equal lengths can have drastically different levels of frequencies.Finding the optimal set of seeds from read sequences is challenging and complex,primarily because the associated search space is large and it grows exponentiallyas the number of seeds increases. There are other variants of seed filtering basedon the pigeonhole principle [18, 61], non-overlapping seeds [19], gapped seeds[107, 63], variable-length seeds [70], random permutation of subsequences [108],or full permutation of all possible subsequences [25, 109, 110].
The second approach to boost the performance of read mappers is to accelerateread alignment step. One of the most fundamental computational steps in mostbioinformatics analyses is the detection of the differences/similarities between twogenomic sequences. Edit distance and pairwise alignment are two approaches toachieve this step, formulated as approximate string matching [24]. Edit distanceapproach is a measure of how much the sequences differ. It calculates the mini-mum number of edits needed to convert one sequence into the other.19he higher the distance, the more different the sequences from one another.Commonly allowed edit operations include deletion, insertion, and substitutionof characters in one or both sequences. Pairwise alignment is a way to identifyregions of high similarity between sequences. Each application employs a dif-ferent edit model (called scoring function), which is then used to generate analignment score. The latter is a measure of how much the sequences are alike.Any two sequences have a single edit distance value but they can have severaldifferent alignments (i.e., ordered lists of possible edit operations and matches)with different alignment scores. Thus, alignment algorithms usually involve abacktracking step for providing the optimal alignment (i.e., the best arrangementof the possible edit operations and matches) that has the highest alignment score.Depending on the demand, pairwise alignment can be performed as global align-ment, where two sequences of the same length are aligned end-to-end, or localalignment, where subsequences of the two given sequences are aligned. It canalso be performed as semi-global alignment (called glocal), where the entirety ofone sequence is aligned towards one of the ends of the other sequence.The edit distance and pairwise alignment approaches are non-additive mea-sures [30]. This means that if we divide the sequence pair into two consecutivesubsequence pairs, the edit distance of the entire sequence pair is not necessarilyequivalent to the sum of the edit distances of the shorter pairs. Instead, we needto examine all possible prefixes of the two input sequences and keep track ofthe pairs of prefixes that provide an optimal solution. Enumerating all possibleprefixes is necessary for tolerating edits that result from both sequencing errors[28] and genetic variations [29]. Therefore, they are typically implemented asdynamic programming algorithms to avoid re-computing the edit distance of theprefixes many times. These implementations, such as Levenshtein distance [31],Smith-Waterman [17], and Needleman-Wunsch [32], are still inefficient as theyhave quadratic time and space complexity (i.e., O( m ) for a sequence length of m ). Many attempts were made to boost the performance of existing sequencealigners. Despite more than three decades of attempts, the fastest known editdistance algorithm [111] has a running time of O( m /log m ) for sequences oflength m , which is still nearly quadratic [112].20herefore, more recent works tend to follow one of two key new directionsto boost the performance of sequence alignment and edit distance implementa-tions: (1) Accelerating the dynamic programming algorithms using hardwareaccelerators. (2) Developing filtering heuristics that reduce the need for thedynamic programming algorithms, given an edit distance threshold. Hard-ware accelerators are becoming increasingly popular for speeding upthe computationally-expensive alignment and edit distance algorithms[113, 114, 115, 116] . Hardware accelerators include multi-core and SIMD (sin-gle instruction multiple data) capable central processing units (CPUs), graphicsprocessing units (GPUs), and field-programmable gate arrays (FPGAs). Theclassical dynamic programming algorithms are typically accelerated by comput-ing only the necessary regions (i.e., diagonal vectors) of the dynamic program-ming matrix rather than the entire matrix, as proposed in Ukkonen’s bandedalgorithm [101]. The number of the diagonal bands required for computing thedynamic programming matrix is 2 E +1, where E is a user-defined edit distancethreshold. The banded algorithm is still beneficial even with its recent sequentialimplementations as in Edlib [78]. The Edlib algorithm is implemented in C forstandard CPUs and it calculates the banded Levenshtein distance. Parasail [54]exploits both Ukkonen’s banded algorithm and SIMD-capable CPUs to computea banded alignment for a sequence pair with user-defined scoring matrix andaffine gap penalty. SIMD instructions offer significant parallelism to the matrixcomputation by executing the same vector operation on multiple operands atonce.Multi-core architecture of CPUs and GPUs provides the ability to computealignments of many sequence pairs independently and concurrently [56, 57].GSWABE [57] exploits GPUs (Tesla K40) for a highly-parallel computation ofglobal alignment with affine gap penalty. CUDASW++ 3.0 [59] exploits theSIMD capability of both CPUs and GPUs (GTX690) to accelerate the computa-tion of the Smith-Waterman algorithm with affine gap penalty. CUDASW++ 3.0provides only the optimal score, not the optimal alignment (i.e., no backtrackingstep). 21ther designs, for instance FPGASW [53], exploit the very large number ofhardware execution units in FPGAs (Xilinx VC707) to form a linear systolic array[117]. Each execution unit in the systolic array is responsible for computing thevalue of a single entry of the dynamic programming matrix. The systolic arraycomputes a single vector of the matrix at a time. The data dependencies betweenthe entries restrict the systolic array to computing the vectors sequentially (e.g.,top-to-bottom, left-to-right, or in an anti-diagonal manner). FPGA accelerationplatform can also provide more speedup to big-data computing frameworks -suchas Apache Spark- for accelerating BWA-MEM [21]. By this integration, Chenet al. [118] achieve 2.6x speedup over the same cloud-based implementation butwithout FPGA acceleration [119]. FPGA accelerators seem to yield the highestperformance gain compared to the other hardware accelerators [52, 53, 68, 58].However, many of these efforts either simplify the scoring function, or only takeinto account accelerating the computation of the dynamic programming matrixwithout providing the optimal alignment as in [66, 67, 59]. Different scoringfunctions are typically needed to better quantify the similarity between two se-quences [120, 121]. The backtracking step required for the optimal alignmentcomputation involves unpredictable and irregular memory access patterns, whichposes a difficult challenge for efficient hardware implementation. Comprehen-sive surveys on hardware acceleration for computational genomics appeared in[113, 114, 115, 116, 33] The third approach to accelerate read mapping is to incorporate a pre-alignmentfiltering technique within the read mapper, before read alignment step. This filteris responsible for quickly excluding incorrect mappings in an early stage (i.e., asa pre-alignment step) to reduce the number of false mappings (i.e., mappingsthat have more edits than the user-defined threshold) that must be verified viadynamic programming. Existing filtering techniques include the so-called shiftedHamming distance (SHD) [65], which we explain next.22 .2.3.1 Shifted Hamming Distance (SHD)
SHD enables pre-alignment filtering with the existence of indels and substitutions.Instead of building a single bit-vector using a pairwise comparison as Hammingdistance does, SHD builds 2 E +1 bit-vectors, where E is the user-defined editdistance threshold. This is similar to the Ukkonen’s banded algorithm [101]. Eachbit-vector is built by gradually shifting the read sequence and then performinga pairwise comparison. The shifting process is inevitable in order to skip thedeleted (or inserted) character and examine the subsequent matches. SHD mergesall masks using bitwise AND operation. Due to the use of AND operation, a zero(i.e., pairwise match) at any position in the 2 E +1 masks leads to a ‘0’ in theresulting output of the AND operation at the same position. The last step is tocount the positions that have a value other than ‘0’. SHD decides if the mapping iscorrect based on whether the number of the mismatches exceeds the edit distancethreshold or not. SHD heavily relies on bitwise operations such as shift, XOR,and AND. This makes SHD suitable for bitwise hardware implementations (e.g.,FPGAs and SIMD-enabled CPUs).Our crucial observation is that SHD examines each mapping, throughout thefiltering process, by performing expensive computations unnecessarily; as SHDuses the same amount of computation regardless the type of edit. SHD is also im-plemented using Intel SSE, which limits the supported read length up to only 128bp (due to SIMD register size). The filtering mechanism of SHD also introducesinaccuracy in its filtering decision as we investigate and demonstrate in Chapter3 and in our experimental evaluation, Chapter 9. We survey in this chapter the existing key directions that aim at accelerating allor part of the operations of modern read mappers. We analyze these attemptsand provide the pros and cons of each direction.23e present three main acceleration approaches, including (1) seed filtering,(2) accelerating the dynamic programming algorithm, (3) pre-alignment filter-ing. In Figure 1.1, we illustrate that the state-of-the-art mapper mrFAST withFastHASH [19] generates more than 90% of the potential mappings as incor-rect ones, although it implements a seed filtering mechanism (Adjacency Filter)and SIMD-accelerated banded Levenshtein edit distance algorithm. This demon-strates that the development of a fundamentally new, fast, and efficient pre-alignment filter is the utmost necessity. Note that there is still no work, to bestof our knowledge, on specialized hardware acceleration of pre-alignment filteringtechniques. 24 hapter 3Understanding and ImprovingPre-alignment Filtering Accuracy
In this chapter, we firstly provide performance metrics used to evaluate pre-alignment filtering techniques. The essential performance metrics are filteringspeed and filtering accuracy. We secondly study the causes of filtering inaccuracyof the state-of-the-art pre-alignment filter, SHD [65], aiming at eliminating them.We find four key causes and provide a detailed investigation along with exampleson these inaccuracy sources. This is the first work to comprehensively assess thefiltering inaccuracy of the SHD algorithm [65] and provide recommendations fordesirable improvements.
An ideal pre-alignment filter should be both fast and accurate in rejecting theincorrect mappings. Meanwhile, it should also preserve all correct mappings.Incorrect mapping is defined as a sequence pair that differs by more than the editdistance threshold. Correct mapping is defined as a sequence pair that has editsless than or equal to the edit distance threshold.25ext, we describe the performance metrics that are necessary to evaluate thespeed and accuracy of existing and future pre-alignment filtering algorithms.
The filtering speed is defined as the time spent by the pre-alignment filter inexamining all the incoming mappings. We always want to increase the speed ofthe pre-alignment filter to compensate the computation overhead introduced byits filtering technique.
The false accept rate (or false positive rate) is the ratio between the incorrectmappings that are falsely accepted by the filter and the incorrect mappings thatare rejected by optimal read alignment algorithm. Similarly, a mapping is con-sidered as a false positive if read alignment accepts it but pre-alignment filterrejects it. We always want to minimize the false accept rate.
The true accept rate (or true positive rate) is the ratio between the correct map-pings that are accepted by the filter and the correct mappings that are acceptedby optimal read alignment algorithm. The true accept rate should always equalto 1. 26 .1.2.3 False Reject Rate
The false reject rate (or false negative rate) is the ratio between the correct map-pings that are rejected by the filter and the correct mappings that are acceptedby optimal read alignment algorithm. The false reject rate should always equalto 0.
The true reject rate (or true negative rate) is the ratio between the incorrectmappings that are rejected by the filter and the incorrect mappings that arerejected by optimal read alignment algorithm. We always want to maximize thetrue reject rate. In fact, the true reject rate is inversely proportional to the falseaccept rate. However, they can be equivalent in ratio in case of all mappings arecorrect and accepted by the filter.
Can very fast filter with high false accept rate be better than more accurate filterat the cost of its speed? The answer to this question is not trivial because bothspeed and accuracy contribute to the overall speed of read alignment. The onlyway to answer this question is evaluate the effect of such filter on the overall speedof read alignment step. Thus, we need to evaluate the end-to-end alignment speed.This includes the integration of pre-alignment filter with read alignment step andevaluate the acceleration rate. Another crucial observation is that pre-alignmentfilter applies heuristic approach, which can be optimal for some alignment caseswhile it fails in other cases. Thus, filter that performs best for specific read setand edit distance threshold may not perform well for other read sets and editdistance thresholds. The user-defined edit distance threshold, E , is usually lessthan 5% of the read length [18, 65, 122, 62].27 .2 On the False Filtering of SHD algorithm In this section, we investigate the potential causes of filtering inaccuracy thatare introduced by the state-of-the-art filter, SHD [65] (we describe the algorithmin Chapter 2). We also provide examples that illustrate each of these causes.Adding an additional fast filtering heuristic before the verification step in a readmapper can be beneficial. But, such a filter can be easily worthless if it allowsa high false accept rate. Even though the incorrect mappings that pass SHDare discarded later by the read alignment step (as it has zero false accept rateand zero false reject rate), they can dramatically increase the execution time ofread mapper by causing a mapping to be examined twice unnecessarily by boththe filtering step as well as read alignment step. Below, we describe four majorsources of false positives that are introduced by the filtering strategy of SHD.
The first source of false accept rate of SHD [65] is the random zeros that appearin the individual shifted Hamming mask. Although they result from a pairwisecomparison between a shifted read and a reference segment, we refer to them asrandom zeros because they are sometimes meaningless and are not part of thecorrect alignment. SHD ANDs all shifted Hamming masks together with the ideathat all ‘0’s in the individual Hamming masks propagate to the final bit-vector,thereby preserving the information of individual matching subsequences. Due tothe use of AND operation, a zero at any position in the 2 E +1 Hamming masksleads to a ‘0’ in the resulting final bit-vector at the same position. Hence, evenif some Hamming masks show a mismatch at that position, a zero in some othermasks leads to a match (‘0’) at the same position. This tends to underestimatethe actual number of edits and eventually causes some incorrect mappings to pass.To fix this issue, SHD proposes the so-called speculative removal of short-matches(SRS) before ANDing the masks, which flips short streaks of ‘0’s in each maskinto ‘1’s such that they do not mask out ‘1’s in other Hamming masks.28 AAAAAAAAAAAAAGAGAGAGAGATATTTAGTGTTGCAGCACTACAACACAAAAGAGGACCAACTTACGTGTCTAAAAGGGGGAACATTGTTGGGCCGGAAAAAAAAAAAAAAAGAGAGAGAGATAGTTAGTGTTGCAGCCACTACAACACAAAAGAGGACCAACTTACGTGTCTAAAAGGGGAGACATTGTTGGGCCGG0000000000000000000000000010000000000001111111011110001110110101101111111110001000001111011010010101 0000000000000011111111111110011111011111000000000000000000000000000000000000000000011000000000000000 0000000000000010000000001011011100111111111111101111000111011010110111111111000100010011101101001010 0000000000000010111111111110111011001101110111011000100100111111111111100101100110010110111011101111 0000000000000111111111111110111110111111011101100010010011111111111110010110011000101011101110111110 0000000000001000000000100111110011111111100100011010101001101011111111111110111001111111000111101100 0000000000010111111111110111011001100011111111101011011111100110010111011111111011101111010111001000Query : Reference :Hamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask : 0000000000000000000000000010000000000001111111111110001111111101111111111110001000001111111111111111 0000000000000011111111111111111111111111000000000000000000000000000000000000000000011000000000000000 0000000000000010000000001111111111111111111111111111000111111111111111111111000100011111111111111110 0000000000000011111111111111111111111111111111111000111111111111111111111111111111111111111111111111 0000000000000111111111111111111111111111111111100011111111111111111111111111111000111111111111111110 0000000000001000000000111111111111111111111100011111111111111111111111111111111111111111000111111100 00000000000111111111111111111111111000111111111111111111111111111111111111111111111111111111111110000000000000000000000000000010000000000001000000000000000000000000000000000000000000001000000000000000Hamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector :(AND output) --- Masks after speculative removal of short-matches (SRS) ---Substitution Deletion 2 Substitutions
Figure 3.1: An example of a mapping with all its generated masks, where theedit distance threshold ( E ) is set to 3. The green highlighted subsequences arepart of the correct alignment. The red highlighted bit in the final bit-vector is awrong alignment provided by SHD. The correct alignment (highlighted in yellow)shows that there are three substitutions and a single deletion, while SHD detectsonly two substitutions and a single deletion.We illustrate this method in Figure 3.1. The number of zeros to be amended(SRS threshold) is set by default to two. That is, bit streams such as 101, 1001 arereplaced with 111 and 1111, respectively. The 2 E +1 masks contain other zerosthat are part of the correct alignment. For example, Figure 3.1 shown a segmentof consecutive matches in one-step right-shifted mask. This segment indicatesthat there is a single deletion that occurred in the read sequence. Unlike thesemeaningful zeros, random amended zeros can be anywhere in the masks exceptthe two ends of each mask. However, the length and the position of these zeros areunpredictable. They can have any length that makes the SRS method ineffectiveat handling these random zeros. There is no clear theory behind the exact SRSthreshold to be used to eliminate such zeros. SRS successfully reduce some ofthe falsely accepted mappings, but it also introduces its own source of falselyaccepted mappings. Choosing a small SRS threshold helps, but does not provideany guarantee, to get rid of some of these random zeros. Choosing a larger SRSthreshold can be risky, since, with such a large threshold, SHD might no longer beable to distinguish whether any streak of consecutive zeros is generated by randomchance or it is part of the correct alignment. This results in SHD ignoring mostof the exact matching subsequences and causes an all-‘1’ final bit-vector.29 AAAAAAAAAACCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTTTCAAAAGAAGACATTTATGCAGCCAAAAGACACATGAAAAAAATGCTCATAAAAAAAAAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCATCQuery : Reference : AAAAAAAAAAACCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTTTCAAAAGAAGACATTTATGCAGCCAAAAGACACATGAAAAAAATGCTCAT|||||||||| |||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||| ||||||||| |||||||||||||AAAAAAAAAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATG-AAAAAATGCTCAT0000000000100000000000000000000000000000000000000100000000000000000000000000100000000000000011111111 00000000001111111100001111111111111111111111111111111000111111111111111111111111111111100000111111110000000000111111111000111111111111111111111111111111111111111111111111111111111110001111000011111100 00000000001111111111111111111111111111111111111111111111100001111111111111111111111111111000111111110000000000000111100001111111111111111111111111111111000111111111111111111111111111111100000000000000 0000000001111111100011111111111111111111111111100011111111111111111111111111111000111100000111111100 00000000111111111111111111111111111111111111111111111100001111111111111111111111111111000011111110000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000Needleman-Wunsch Alignment: Random zerosHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector :
Figure 3.2: Examples of an incorrect mapping that passes the SHD filter due tothe random zeros. While the edit distance threshold is 3, a mapping of 4 edits(as examined at the end of the figure by Needleman-Wunsch algorithm) passesas a falsely accepted mapping.In Figure 3.2, we provide an example where random zeros dominate and leadto a zero in the final bit-vector at their corresponding locations. SRS can addressthe inaccuracy caused by the random 3-bit zeros, which are highlighted by theleft arrow, using an SRS threshold of 3. However, SRS is still unable to solvethe inaccuracy caused by the 15-bit zeros that are highlighted by the right arrow.This is due to the fact that the 15-bit zeros are part of the correct alignment andhence amending them to ones can introduce more falsely accepted mappings.
The second source of high false accept rate of SHD [65] is related to the way inwhich SHD counts the edits in the final bit-vector. Amending short streaks of‘0’s to ‘1’s could cause correct mappings to be mistakenly filtered out, as it mayproduce multiple ones in the final bit-vector. To ensure that it does not overcountthe number of edits, SHD always assumes the streaks of ‘1’s in the final bit-vectoras a side effect of the SRS amendment, and counts only the minimum number ofedits that potentially generate such a streak of ‘1’s. The total number of editsreported by SHD can be much smaller than the actual number of edits.30
AAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAATTCACACTTCTCAAAAGAAGACATTTCTCAGCCAAAAAACACATGAAAAAATGCTCAAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCTQuery : Reference : AAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAATTCACACTTCTCAAAAGAAGACATTT-CTCAGCCAAAAAACACATGAAAAAATGCT|||||||||||||||||||||||||||||||||||||||||||| : ||||||||||||||||||||||| |||||||||||||||||||||||||||||AAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCT0000000000000000000000000000000000000000000011100000000000000000000000111111110000011111110000011111 0000000111111110001111000011111111111111111111111111111110001111111111100000000000000000000000000000 0000000111111111111111100011111111111111111111110001111111111111111111111111111000001111111000001111 0000000111110001111111111111111111111111111111111111111111111000011111111111111100001000111100001111 0000001111111100011110000111111111111111111111111111111100011111111111111111110000100011110000111100 0000011111111111111110001111111111111111111111111111111111111111111111111111110001111111110001111100 00001111100011111111111111111111111111111111111111111111110000111111111111111111111111111111111110000000000000000000000000000000000000000000000011100000000000000000000000100000000000000000000000000000Final bit-vector :Needleman-Wunsch Alignment: Misinterpreted as a single editHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :The 3-bit ones are a result of substitutions and not the amendment
Figure 3.3: An example of an incorrect mapping that passes the SHD filter dueto conservative counting of the short streak of ‘1’s in the final bit-vector.For instance, as illustrated in Figure 3.3, three consecutive substitutions rendera streak of three ‘1’s in the final bit-vector. But since SHD always assumes themiddle ‘1’ is the result of an amended ‘0’ by SRS, SHD will only consider thestreak of three ‘1’s as a single edit and let it pass, even if the edit distancethreshold is less than three.
The third source of high false accept rate of SHD [65] is the streaks of zeros thatare located at any of the two ends of each mask. Hence we refer to them asleading and trailing zeros. These streaks of zeros can be in two forms: (1) thevacant bits that are caused by shifting the read against the reference segmentand (2) the streaks of zeros that are not vacant bits. SHD generates 2 E +1 masksusing arithmetic left-shift and arithmetic right-shift operations. For both the leftand right directions, the right-most and the left-most vacant bits, respectively,are filled with ‘0’s. The number of vacant zeros depends on the number of shiftedsteps for each mask, which is at most equal to the edit distance threshold. Thesecond form of the leading and trailing zeros is the zeros that are located at thetwo ends of the Hamming masks and are not vacant zeros. These streaks of zerosresult from the pairwise comparison (i.e., bitwise XOR). They differ from thevacant bits in that their length is independent of the edit distance threshold.31 ATCAAACAACCCCATCAACAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCATCAAAAAAACAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCGTCQuery : Reference : AAAAAAACAACCCCATCAAAAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCGTC||: ||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||:||AATCAAACAACCCCATCAACAAGTGGGCAAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTATGCAGCCAACAGACACATGAAAAAATGCTCATC0011000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000100 0011000111100011111111111111111111111111111111111111100011111111111111111111111111111110000011111111001100011111111111111111111111111111111111111111111111111111111111111111111111111000111100001111111100010001000111111110001111111111111111111111111111111111100001111111111111111111111111111000111111000011111111000111111111111111111111111111111111111111000111111111111111111111111111111100000111111110 0011111111111111111111111111111111111111111111111111111111111111111111111111111000111100001111111100 00111000111111111111111111111111111111111111111111111100001111111111111111111111111111000111111000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000Needleman-Wunsch Alignment: Vacant bitsHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector : Vacant bits Trailing zerosLeading zeros
Figure 3.4: Examples of an invalid mapping that passes the SHD filter due to theleading and trailing zeros. We use an edit distance threshold of 3 and an SRSthreshold of 2. While the regions that are highlighted in green are part of thecorrect alignment, the wrong alignment provided by SHD is highlighted in red.The yellow highlighted bits indicate a source of falsely accepted mapping.The main issue with both forms of leading and trailing zeros is that they alwaysdominate, even if some Hamming masks show a mismatch at that position (dueto the use of the AND operation). This gives the false impression that the readand the reference have a smaller edit distance, even when they differ significantly,as explained in Figure 3.4. SRS does not address the inaccuracy caused by theleading and trailing zeros by amending such zeros to ones, due to two reasons:(1) the number of these consecutive zeros is not fixed and thus they can be longerthan the SRS threshold, (2) these consecutive zeros are not surrounded by onesand hence even if SRS threshold is greater than two bits, they are not eligible tobe amended.
The last source of false accept rate of in SHD [65] is the inability of SHD tobacktrack (after generating the final bit-vector) the location of each long identicalsubsequence (i.e., the mask that originates the identical subsequence), which ispart of the correct alignment. The source of each subsequence provides a keyinsight into the actual number of edits between each two subsequences. That is,if a subsequence is located in a 2-step right shifted mask, it should indicate thatthere are two deletions before this subsequence.32
AAAAAACAAACAACCCCAGAAAAAGTGGGTGAAGGACTATGAACAGACACTTCTCAAAAGAAGACTTTACTCAGCCAAAAAACACATGAAAAAATGCTAAAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGATATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCTQuery : Reference : AAAAAAACAAACAACCCCAG-AAAAAGTGGGTGAAGGACTATGAACAGACACTTCTCAAAAGAAGAC-TTTACTCAGCCAAAAAACACATGAAAAAATGCT||||||||||||||||||| ||||||||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||| AAAAAAACAAACAACCCCATCAAAAAGTGGGTGAAGGA-TATGAACAGACACTTCTCAAAAGAAGACATTTACTCAGCCAAAAAACACATGAAAAAATGCT000000000000000000011000011111111111110000000000000000000000000000111111111111000001111111000001111100000001111111100011100000000000000000111111111111111111100011111110000000000000000000000000000000000000000111111111111111000011111111111111111111111111111111111111111111111111111000001111111000001111 00000001111100011111111000111111111111111111111111111111111110000111111111111111000010001111000011110000001111111100011110001111111111111111111111111111111100011111111111111111110000100011110000111110 0000011111111111111111111111111111111111111111111111111111111111111111111111110001111111110001111100 00001111100011111111111111000111111111111111111111111111110000111111111111111111111111111111111110000000000000000000000110000000000000000000000000000000000000000000001000000000000000000000000000000000Needleman-Wunsch Alignment: Overlapping subsequences can hide some editsHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector : AAAAAAAAAAATTAGCCAGGTGTGGTGGCACCCCCTGCCTATAATCCCAGCTACTCGGGAGGGAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGTTAAAAAAAAAAATTAGCCAGGTGTGGTGGCACATGCCTATAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGTTGQuery : Reference : AAAAAAAAAAATTAGCCAGGTGTGGTGGCACCCCCTGCCTATAATCCCAGCTACTCGGGAGG--GAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGT||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||AAAAAAAAAAATTAGCCAGGTGTGGTGGCAC---ATGCCTATAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCTGGGAGGGGGAGGT0000000000000000000000000000000111111111111111111111111111111111111111111111111111111111110000111111000000000001111111111111111111111100011111111111111111111111111111111111111111111111111111100011111100000000000111111111100011111111111111111111111111111111111111111111111111111111111111111111111111100000000000011111111111110000111111111111111111111111111111111000010001111111111111111111100011111110 00000000001111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000001111111111000111111111111111111111111111111111111111111111111111111111111111111100001111000 00000000111111111111100001111111000000000000000000000000000111111111111111111111111111111000111110000000000000000000000000000000000100000000000000000000000000011000000000000000000000000000000000000000Needleman-Wunsch Alignment: Backtracking this subsequence can tell that it is a result of three insertionsHamming Mask : 1-Deletion Mask :2-Deletion Mask :3-Deletion Mask :1-Insertion Mask :2-Insertion Mask :3-Insertion Mask :Final bit-vector : (a)(b)
Figure 3.5: Examples of incorrect mappings that pass the SHD filter due to (a)overlapping identical subsequences, and (b) lack of backtracking.SHD does not relate this important fact to the number of edits in the finalbit-vector. The lack of backtracking causes two types of falsely accepted map-ping: (1) the first type appears clearly when two of the identical subsequences, inthe individual Hamming masks, are overlapped or nearly overlapped, (2) the sec-ond type happens when the identical subsequences come from different Hammingmasks. The issue with the first type (i.e., overlapping subsequences) is the factthat they appear as a single identical subsequence in the final bit-vector, due tothe use of AND operation. An example of this scenario is given in Figure 3.5 (a).This tends to hide some of the edits and eventually causes some invalid mappingsto pass. The second type of false positives caused by the lack of backtrackinghappens, for example, when an identical subsequence comes from the first Ham-ming mask (i.e., with no shift) and the next identical subsequence comes from the3-step left shifted mask. This scenario reveals that the number of edits betweenthe two subsequences should not be less than three insertions.33owever, SHD inaccurately reports it as a single edit (due to ANDing allHamming masks without backtracking the source of each streak of zeros), asillustrated in Figure 3.5 (b). Keeping track of the source mask of each identicalsubsequence prevents such false positives and helps to reveal the correct numberof edits.
In this section, we provide our own observations and recommendations based onour comprehensive accuracy analysis of SHD filter [65]. We make two crucialobservations. (1) The first observation is that handling the short streaks of ‘0’s(i.e., using the SRS method that we discuss above) is indeed inefficient. These“noisy” streaks do not have determined properties, as their length and numberare unpredictable (random-like). They introduce their own sources of falselyaccepted mappings and do not contribute any useful information. Therefore,future filtering strategies should avoid processing such short streaks of ‘0’s. (2)The second observation is that the correct (desired) alignment always containsall the longest non-overlapping identical subsequences. This turns our attention tofocusing on the long matches (that are highlighted in green in all previous figures,i.e., Figure 3.1 to Figure 3.5) in each Hamming mask. We find that the long non-overlapping subsequences of consecutive zeros have two interesting properties. (1)There is an upper bound on their quantity. With the existence of E edits, thereare at most E +1 non-overlapping identical subsequences shared between a pairof sequences. The total length of these non-overlapping subsequences is equal to m-E , where m is the read length. (2) The source mask of each long subsequenceprovides an insight into the number of edits between this subsequence and itspreceding one. These two observations motivate us to incorporate long-match-awareness into the design of our filtering strategy and ignore processing noisyshort matches. 34 .4 Summary We identify four causes that introduce the filtering inaccuracy of the SHD [65]algorithm, namely, the random zeros, conservative counting, leading and trailingzeros, and lack of backtracking. Based on these four sources of falsely acceptedmapping, we observe that there are still opportunities for further improvementson the accuracy of the state-of-the-art filter, SHD, which we discuss next.35 hapter 4The First Hardware Acceleratorfor Pre-Alignment Filtering
In this chapter, we introduce a new FPGA-based accelerator architecture forhardware-aware pre-alignment filtering algorithms. To our knowledge, this isthe first work that exploit reconfigurable hardware platforms to accelerate pre-alignment filtering. A fast filter designed on a specialized hardware platform candrastically expedite alignment by reducing the number of locations that must beverified via dynamic programming. This eliminates many unnecessary expensivecomputations, thereby greatly improving overall run time.
We select FPGA as an acceleration platform for our proposed pre-alignmentfiltering algorithms, as its architecture offers large amounts of parallelism[114, 123, 124]. The use of FPGA as an acceleration platform can yield significantperformance improvements, especially for massively parallel algorithms.36PGAs are the most commonly used form of reconfigurable hardware enginestoday in bioinformatics [125, 126, 115], and their computational capabilities aregreatly increasing every generation due to increased number of transistors on theFPGA chip. An FPGA chip can be programmed (i.e., configured) to include avery large number of hardware execution units that are custom-tailored to theproblem at hand.
One of our aims is to accelerate our new pre-alignment filtering algorithms (thatwe describe in the next three chapters) by leveraging the capabilities and par-allelism of FPGAs. To this end, we build our own hardware accelerator thatconsists of an FPGA engine as an essential component and a CPU. We present inFigure 4.1 the overall architecture of our FPGA-based accelerator. The CPU isresponsible for acquiring and encoding the short reads and transferring the datato and from the FPGA. The FPGA engine is equipped with PCIe transceivers,Read Controller, Result Controller, and a set of filtering units that are respon-sible for examining the read alignment. The workflow of the accelerator startswith reading a repository of short reads and seed locations. All reads are thenconverted into their binary representation that can be understood by the FPGAengine. Encoding the reads is a preprocessing step and accomplished througha Read Encoder at the host before transmitting the reads to the FPGA chip.Next, the encoded reads are transmitted and processed in a streaming fashionthrough the fastest communication medium available on the FPGA board (i.e.,PCIe). We design our system to perform alignment filtering in a streaming fash-ion: the accelerator receives a continual stream of short reads, examines eachalignment in parallel with others, and returns the decision (i.e., a single bit ofvalue ‘1’ for an accepted sequences and ‘0’ for a rejected sequences) back to theCPU instantaneously upon processing. 37 reprocessing Host (CPU) input reads (.fastq) reference genome (.fasta) R ea d E n c od e r Read Mapper map.
Pre-Alignment Filtering (FPGA) Alignment Verification (CPU/FPGA)
PCIe PCIe
A C T T A G C A C T A C T A G A A C T T C T A T A A T A C GCCATATATACG A . . .. . .. . .
Filter
Read ControllerResult Controller
FIFOFIFOFIFOFIFOFIFOFIFO FIFOFIFO F il t e r i ng M odu l es Figure 4.1: Overview of our accelerator architecture.
The Read Controller on the FPGA side is responsible for two main tasks. First, itpermanently assigns the first data chunk as a reference sequence for all processingcores. Second, it manages the subsequent data chunks and distributes them tothe processing cores. The first processing core receives the first read sequenceand the second core receives the second sequence and so on, up to the last core.It iterates the data chunk management task until no more reads are left in therepository.
Following similar principles as the Read Controller, the Result Controller gathersthe output results of the filtering units. Both the Read Controller and the ResultController preserve the original order of reads as in the repository (i.e., at thehost). This is critical to ensure that each read will receive its own alignmentfiltering result. The results are transmitted back to the CPU side in a streamingfashion and then saved in the repository.38 . mm Figure 4.2: Xilinx Virtex-7 FPGA VC709 Connectivity Kit and chip layout ofour implemented accelerator.
We design our hardware accelerator to exploit the large amounts of parallelismoffered by FPGA architectures [114, 123, 124, 127, 128, 129, 130, 131, 132]. Wetake advantage of the fact that alignment filtering of one read is inherently in-dependent of filtering of another read. We therefore can examine many reads ina parallel fashion. In particular, instead of handling each read in a sequentialmanner, as CPU-based filters (e.g., SHD) do, we can process a large number ofreads at the same time by integrating as many hardware filtering units as possible(constrained by chip area) in the FPGA chip. Each filtering unit is a completealignment filter and can handle a single read at a time. Our hardware acceleratorcontains large number of filtering units that their number can be configured bythe user. Each filtering unit provides pre-alignment filtering individually fromall other units. We use the term “filtering unit” in this work to refer to theentire operation of the filtering process involved. Filtering units are part of ourarchitecture and are unrelated to the term “CPU core” or “thread”.39 .4 Hardware Implementation
Our hardware implementation of our accelerator is independent from specificFPGA-platform as it does not rely on any vendor-specific computing elements(e.g., intellectual property cores). However, each FPGA board has different fea-tures and hardware capabilities that can directly or indirectly affect the perfor-mance and the data throughput of the design. In fact, the number of filteringunits is determined by the maximum data throughput and the available FPGAresources. We use a Xilinx Virtex 7 VC709 board [133] to implement our accelera-tor architecture. We build the FPGA design with Vivado 2015.4 in synthesizableVerilog. We preset the chip layout of out hardware accelerator in Figure 4.2.The maximum operating frequency of our accelerator and the VC709 board is250 MHz. At this frequency, we observe a data throughput of nearly 3.3 GB/s,which corresponds to 13.3 billion bases per second. This nearly reaches the peakthroughput of 3.64 GB/s provided by the RIFFA [134] communication channelthat feeds data into the FPGA using Gen3 4-lane PCIe.
We introduce in this chapter a new hardware accelerator architecture that exploitthe large amounts of parallelism offered by FPGA architectures to boost theperformance of our pre-alignment filters. Our hardware accelerator processes thepre-alignment filtering for each sequence pair independently from each another.We therefore can examine many reads in a parallel fashion. We build the hardwarearchitecture of our hardware accelerator using many hardware filtering units,where each filtering unit is a complete pre-alignment filter and can handle asingle read at a time. To take full advantage of the capabilities and parallelismof our FPGA accelerator, each pre-alignment filtering unit needs to be designedand implemented using FPGA-supported operations such as bitwise operations,bit shifts, and bit count. Next, we discuss our proposed pre-alignment filters thatcan be included in our FPGA accelerator as a filtering unit.40 hapter 5GateKeeper: Fast HardwarePre-Alignment Filter
In this chapter, we introduce a new FPGA-based fast alignment filtering technique(called GateKeeper) that acts as a pre-alignment step in read mapping. Ourfiltering technique improves and accelerates the state-of-the-art SHD filteringalgorithm [65] using new mechanisms and FPGAs.
Our new filtering algorithm has two properties that make it suitable for an FPGA-based implementation: (1) it is highly parallel, (2) it heavily relies on bitwiseoperations such as shift, XOR, and AND. Our architecture discards the incorrectmappings from the candidate mapping pool in a streaming fashion – data isprocessed as it is transferred from the host system. Filtering the mappings in astreaming fashion gives the ability to integrate our filter with any mapper thatperforms alignment, such as Bowtie 2 [22] and BWA-MEM [21]. Our currentfilter implementation relies on several optimization methods to create a robustand efficient filtering approach. 41t both the design and implementation stages, we satisfy several requirements:(1) Ensuring a lossless filtering algorithm by preserving all correct mappings.(2) Supporting both Hamming distance and edit distance. (3) Examining thealignment between a read and a reference segment in a fast and efficient way (interms of execution time and required resources).
Our primary purpose is to enhance the state-of-the-art SHD alignment filter suchthat we can greatly accelerate pre-alignment by taking advantage of the capa-bilities and parallelism of FPGAs. To achieve our goal, we design an algorithminspired by SHD to reduce both the utilized resources and the execution time.These optimizations enable us to integrate more filtering units within the FPGAchip and hence examine many mappings at the same time. We present threenew methods that we use in each GateKeeper filtering unit to improve execu-tion time. Our first method introduces a new algorithmic method for performingalignment very rapidly compared to the original SHD. This method provides: (1)fast detection for exact matching alignment and (2) handling of one or more base-substitutions. Our second method supports calculating the edit distance with anew, very efficient hardware design. Our third method addresses the problem ofhardware resource overheads introduced due to the use of FPGA as an acceler-ation platform. All methods are implemented within the hardware filtering unitof our accelerator (see Chapter 4) and thus are performed highly efficiently. Wepresent a flowchart representation of all steps involved in our algorithm in Figure5.1. Next, we describe the three new methods.
We first discuss how to examine a mapping with a given Hamming distancethreshold, and later extend our solution to support edit distance.42ur first method aims to quickly detect the obviously-correct alignments thatcontain no edits or only few substitutions (i.e., less than the user-defined thresh-old). If the first method detects a correct alignment, then we can skip the othertwo methods but we still need the optimal alignment algorithms. A read is map-pable if the Hamming distance between the read and its seed location does notexceed the given Hamming distance threshold. Hence, the first step is to identifyall bp matches by calculating what we call a Hamming mask. The Hamming maskis a bit-vector of ‘0’s and ‘1’s representing the comparison of the read and the ref-erence, where a ‘0’ represents a bp match and a ‘1’ represents a bp mismatch. Weneed to count only occurrences of ‘1’ in the Hamming mask and examine whethertheir total number is equal to or less than the user-defined Hamming distancethreshold. If so, the mapping is considered to be valid and the read passes thefilter. Similarly, if the total number of ‘1’ is greater than the Hamming distancethreshold then we cannot be certain whether this is because of the high number ofsubstitutions, or there exist insertions and/or deletions; hence, we need to followthe rest of our algorithm. Our filter can detect not only substitutions but alsoinsertions and deletions in an efficient way, as we discuss next.
Our indel detection algorithm is inspired by the original SHD algorithm presentedin [65]. If the substitution detection rejects an alignment, then GateKeeper checksif an insertion or deletion causes the violation (i.e., high number of edits). Fig-ure 5.2 illustrates the effect of occurrence of edits on the alignment process. Ifthere are one or more base-substitutions or the alignment is exactly matching,the matching and mismatching regions can be accurately determined using Ham-ming distance. It also helps detect the matches that are located before the firstindel. However, this mask is already generated as part of the first method of thealgorithm (i.e., Fast Approximate String Matching). On the other hand, eachinsertion and deletion can shift multiple trailing bases and create multiple editsin the Hamming mask. Thus, our indel detection method identifies whether thealignment locations of a read are valid, by shifting individual bases.43 onvert the two segments into their binary representationsStartApply “Fast Approximate String Matching” (Section 2.3.1) P r e p r o c e ss i n g H o s t H o s t A li g n m e n t F il t e r i n g ( F P G A ) ACGTC ….ATTGC ….0010011101 ….0011111001 ….Generates one maskGenerates 2E masksReduces the length of each mask to a halfGenerates the final bit-vector mask
Figure 5.1: A flowchart representation of the GateKeeper algorithm.44
CCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCCTCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCCTCCATTGACACTCGTGAGCTGCACCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓TCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↺ ↺
Substitution:
TCCATTGACAGTTCGTGAGCTGCTCCTTCTTCTCCCACCCCTTTGCTCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙↙
TCCATTGACATTCGGAGCTGCTCCTTCTCTCCACCCCTTTGCCCTTTCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCC↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↘↘↘↘↘↘↘↘↘↘↘↘↘↘↘↘
Insertion:Deletion:Exact Match: ↓ ↘↙ Substitution, Deletion or Insertion ↺ Match Mismatch1-step shift match 2-step shift match (a)(b)(c)(d)
Figure 5.2: An example showing how various types of edits affect the alignmentof two reads. In (a) the upper read exactly matches the lower read and thuseach base exactly matches the corresponding base in the target read. (b) showsbase-substitutions that only affect the alignment at their positions. (c) and (d)demonstrate insertions and deletions, respectively. Each edit has an influence onthe alignment of all the subsequent bases.We need to perform E incremental shifts to the right (or left) direction todetect any read that has E deletions (or insertions), where E is the edit distancethreshold. The shift process guarantees to cancel the effect of indel. As we donot have prior knowledge about whether there exist substitutions, or indels, orcombination of both, we need to test for every possible case in our algorithm.Thus, GateKeeper generates 2 E +1 Hamming masks regardless the source of theedit. The last step is to merge all the 2 E +1 Hamming masks using a bitwiseAND operation. This step tells us where the relevant matching and mismatchingregions reside in the presence of edits in the read compared to the referencesegment. Similarly to SHD, we apply amending process to the 2 E +1 masks beforeperforming AND operation. In SHD, the amending process is accomplished usinga 4-bit packed shuffle (SIMD parallel table-lookup instruction), shift, and ORoperations. The number of computations needed is 4 packed shuffle, 4 m bitwiseOR, and three shift operations for each Hamming mask, which is (7+4 m )(2 E +1)operations, where m is the read length. We find that this is very inefficient forFPGA implementation. 45 Hamming mask:
Hamming mask after amending: . . . . . . . . . .
Figure 5.3: Workflow of the proposed architecture for the parallel amendmentoperations.To reduce the number of operations, we build a new hardware-based amendingprocess. We propose using dedicated hardware components in FPGA slices. Moreprecisely, rather than shifting the read and then performing packed shuffle toreplace patterns of 101 or 1001 to 111 or 1111 respectively, we perform onlypacked shuffle independently and concurrently for each bit of each Hammingmask. We present the proposed architecture for amendment operation in Figure5.3. In order to replace all patterns of 101 or 1001 to 111 or 1111 respectively,we use a single 5-input look-up table (LUT) for each bit of the Hamming mask.The first LUT copies the bit value of the first input regardless of its value; evenif it is zero, it will not be amended as it is not contributing to the 101 or 1001pattern.Likewise for the last LUT. Thus, the total number of LUTs needed is equalto the length of the short read in bases minus 2 for the first and last bases. Ineach LUT, we consider a single bit of the Hamming mask and two of its rightneighboring bits and two of its left neighboring bits. If the input that correspondsto the output has a bit value of one, then the output copies the value of that inputbit (as we only amend zeros). Otherwise, using the previous two bits and thefollowing two bits with respect to the input bit, we can replace any zero of the“101” or “1001” patterns independently from other output bits. All bits of theamended masks are generated at the same time, as the propagation delay throughan FPGA look-up table is independent of the implemented function [135].46hus we can process all masks in a parallel fashion without affecting the cor-rectness of the filtering decision. Using this dedicated architecture, we are able toget rid of the four shifting operations and perform the amending process concur-rently for all bits of any Hamming mask. Thus, the required number of operationsis only (2 E +1) instead of (7+4 m )(2 E +1) for a total of (2 E +1) Hamming masks.This saves a considerable amount of the filtering time, reducing it by two ordersof magnitude for a read that is 100bp long. The short reads are composed of a string of nucleotides from the DNA alphabet { A, C, G, T } . Since the reads are processed in an FPGA platform, the symbolshave to be encoded into a unique binary representation. We need 2 bits to encodeeach symbol. Hence, encoding a read sequence of length m results in a 2 m -bitword. Encoding the reads into a binary representation introduces overhead toaccommodate not only the encoded reads but also the Hamming masks as theirlengths also double (i.e., 2 m ). The issue introduced by encoding the read canbe even worse when we apply certain operations on these Hamming masks. Forexample, the number of LUTs required for performing the amending process onthe Hamming masks will be doubled, mainly due to encoding the read.To reduce the complexity of the subsequent operations on the Hamming masksand save about half of the required amount of FPGA resources, we propose a newsolution. We observe that comparing a pair of DNA nucleotides is similar to com-paring their binary representations (e.g., comparing A to T is similar to comparing‘00’ to ‘11’). Hence, comparing each two bits from the binary representation ofthe read with their corresponding bits of the reference segment generates a singlebit that represents one of two meanings; either match or mismatch between twobases. 47 CCATTCCAG
Read:Reference: Nucleotide
Binary (after encoding)
0: Match 1: Mismatch
Hamming mask:Representation: ⊕ The modified Hamming mask after applying our solution
2m bitsm bits
Figure 5.4: An example of applying our solution for reducing the number of bitsof each Hamming mask by half. We use a modified Hamming mask to store theresult of applying the bitwise OR operation to each two bits of the Hammingmask. The modified mask maintains the same meaning of the original Hammingmask.This is performed by encoding each two bits of the result of the pairwisecomparison (i.e., bitwise XOR) into a single bit of ‘0’ or ‘1’ using OR operationsin a parallel fashion, as explained in Figure 5.4. This makes the length of eachHamming mask equivalent to the length of the original read, without affectingthe meaning of each bit of the mask. The modified Hamming masks are thenmerged together in 2 E bitwise AND operations. Finally, we count the numberof ones (i.e., edits) in the final bit-vector mask; if the count is less than the editdistance threshold, the filter accepts the mapping. In this section, we provide the pseudocode of our GateKeeper algorithm. Algo-rithm 5.1 presents the main functions. Algorithms 5.2 presents the details of theamending process using in Algorithm 5.1.48 lgorithm 5.1:
GateKeeper filtering algorithm
Input: Candidate read bit-vector B = { b ,b ...b m }, Reference bit-vector A={ a ,a ...a m }, edit distance threshold E Output: Pass (return True if the read passes the GateKeeper filter). Functions: Amend: Encodes/amends the masks. Pseudocode:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: //Calculate Hamming distance first.
HammingMask[2E+1] ← A ⊕ B; AmendedMask[2E+1] ←
Amend (HammingMask[2E+1]); e ← //Generate 2E masks with incremental shift. for i ← 1 to E do
HammingMask[i] ← (b>>i) ⊕ a;
AmendedMask[i] ←
Amend (HammingMask[i]); HammingMask[i+E] ← (b<
Input: Hamming mask bit-vector, H = { H ,H ...H } Output: modified Hamming mask, C = { C ,C ...C m } Pseudocode:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: i ← 0; while i We introduce the first hardware acceleration system for alignment filtering, calledGateKeeper. We develop both a hardware-acceleration-friendly filtering algo-rithm and a highly-parallel hardware accelerator design. GateKeeper is a stan-dalone filter and can be integrated with any existing reference-based mapper.GateKeeper does not replace read alignment. GateKeeper should be followed byread alignment step, which precisely verifies the mappings that pass our filter andeliminates the falsely accepted ones. 51 hapter 6Shouji: Fast and AccurateHardware Pre-Alignment Filter Our primary purpose is to reject incorrect mappings accurately and quickly suchthat we reduce the need for the computationally expensive alignment step. In thischapter, we propose the Shouji algorithm to achieve highly accurate filtering. Wethen accelerate Shouji by taking advantage of the capabilities and parallelism ofFPGAs to achieve fast filtering operations. Shouji’s filtering strategy is inspiredby our analytical study of SHD’s filtering accuracy (see Chapter 3). We discussthe details of the Shouji algorithm next. The key filtering strategy of Shouji is inspired by the pigeonhole principle ,which states that if E items are distributed into E +1 boxes, then one or moreboxes would remain empty. In the context of pre-alignment filtering, this principleprovides the following key observation: if two sequences differ by E edits, thenthe two sequences should share at least a single common subsequence (i.e., freeof edits) and at most E +1 non-overlapping common subsequences, where E is52 e e E m m m m E+1 Figure 6.1: Random edit distribution in a read sequence. The edits ( e , e , . . . ,e E ) act as dividers resulting in several identical subsequences ( m , m , . . . , m E +1 )between the read and the reference.the edit distance threshold. With the existence of at most E edits, the totallength of these non-overlapping common subsequences should not be less than m-E , where m is the sequence length, as illustrated in Figure 6.1. Shouji employsthe pigeonhole principle to decide whether or not two sequences are potentiallysimilar. Shouji finds all the non-overlapping subsequences that exist in bothsequences. If the total length of these common subsequences is less than m-E , then there exist more edits than the allowed edit distance threshold, andhence Shouji rejects the two given sequences. Otherwise, Shouji accepts the twosequences. Next, we discuss the details of Shouji. Shouji identifies the dissimilar sequences, without calculating the optimal align-ment, in three main steps. (1) The first step is to construct what we call a neighborhood map that visualizes the pairwise matches and mismatches betweentwo sequences given an edit distance threshold of E characters. (2) The secondstep is to find all the non-overlapping common subsequences in the neighborhoodmap using a sliding search window approach. (3) The last step is to accept orreject the given sequence pairs based on the length of the found matches. If thelength of the found matches is small, then Shouji rejects the input sequence pair.53 .2.1 Method 1: Building the Neighborhood Map The neighborhood map, N , is a binary m by m matrix, where m is the readlength. Given a text sequence T [1. . . m ], a pattern sequence P [1. . . m ], and anedit distance threshold E , the neighborhood map represents the comparison resultof the i th character of P with the j th character of T , where i and j satisfy 1 ≤ i ≤ m and i-E ≤ j ≤ i+E , the entry N [ i, j ] of the neighborhood map can becalculated as follows: N [ i,j ] = P [ i ] = T [ j ]1 P [ i ] (cid:54) = T [ j ] (6.1)We present in Figure 6.2 an example of a neighborhood map for two sequences,where sequence B differs from sequence A by three edits. The entry N [ i,j ] is setto zero if the i th character of the pattern matches the j th character of the text.Otherwise, it is set to one. The way we build our neighborhood map ensures thatcomputing each of its entries is independent of every other, and thus the entiremap can be computed all at once in a parallel fashion. Hence, our neighborhoodmap is well suited for highly-parallel computing platforms [71, 136]. Note that insequence alignment algorithms, computing each entry of the dynamic program-ming matrix depends on the values of the immediate left, upper left, and upperentries of its own. Different from “dot plot” or “dot matrix” (visual representationof the similarities between two closely similar genomic sequences) that is used inFASTA/FASTP [137], our neighborhood map computes only necessary diagonalsnear the main diagonal of the matrix (e.g., for E =3, Shouji computes only 2 E +1diagonal vectors of the neighborhood map). The key goal of this step is to accurately find all the non-overlapping commonsubsequences shared between a pair of sequences. The accuracy of finding thesesubsequences is crucial for the overall filtering accuracy, as the filtering decision54 1 2 3 4 5 6 7 8 9 10 11 12G G T G C A G A G C T CGGTGAGAGTTGTi123456789101112 Neighborhood map: 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1Three common subsequences SearchWindow 0 0 0 0 1 0 0 0 0 1 0 1 Last bottom right entrySearch window Shouji bit-vector: . . . . Figure 6.2: Neighborhood map ( N ) and the Shouji bit-vector, for text T =GGTGCAGAGCTC, and pattern P = GGTGAGAGTTGT for E =3. The threecommon subsequences (i.e., GGTG, AGAG, and T) are highlighted in yellow. Weuse a search window of size 4 columns (two examples of which are high-lighted inred) with a step size of a single column. Shouji searches diagonally within eachsearch window for the 4-bit vector that has the largest number of zeros. Oncefound, Shouji examines if the found 4-bit vector maximizes the number of zerosat the corresponding location of the 4-bit vector in the Shouji bit-vector. If so,then Shouji stores this 4-bit vector in the Shouji bit-vector at its correspondinglocation. 55s made solely based on total subsequence length. With the existence of E ed-its, there are at most E +1 non-overlapping common subsequences (based on thepigeonhole principle) shared between a pair of sequences. Each non-overlappingcommon subsequence is represented as a streak of diagonally-consecutive zeros inthe neighborhood map (as highlighted in yellow in Figure 6.2). These streaks ofdiagonally-consecutive zeros are distributed along the diagonals of the neighbor-hood map without any prior information about their length or number. One wayof finding these common subsequences is to use a brute-force approach, which ex-amines all the streaks of diagonally-consecutive zeros that start at the first columnand selects the streak that has the largest number of zeros as the first commonsubsequences. It then iterates over the remaining part of the neighborhood mapto find the other common subsequences. However, this brute-force approach isinfeasible for highly-optimized hardware implementation as the search space isunknown at design time. Shouji overcomes this issue by dividing the neighbor-hood map into equal-size parts. We call each part a search window. Limiting thesize of the search space from the entire neighborhood map to a search windowhas three key benefits. (1) It helps to provide a scalable architecture that can beimplemented for any sequence length and edit distance threshold. (2) Downsizingthe search space into a reasonably small sub-matrix with a known dimension atdesign time limits the number of all possible permutations of each bit-vector to2 n , where n is the search window size. This reduces the size of the look-up tables(LUTs) required for an FPGA implementation and simplifies the overall design.(3) Each search window is considered as a smaller sub-problem that can be solvedindependently and rapidly with high parallelism. Shouji uses a search windowof 4 columns wide, as we illustrate in Figure 6.2. We need m search windowsfor pro-cessing two sequences, each of which is of length m characters. Eachsearch window overlaps with its next neighboring search window by 3 columns.This ensures covering the entire neighborhood map and finding all the commonsubsequences regardless of their starting location. We select the width of eachsearch window to be 4 columns to guarantee finding the shortest possible com-mon subsequence, which is a single match located between two mismatches (i.e.,‘101’). However, we observe that the bit pattern ‘101’ is not always necessarilya part of the correct alignment (or the common subsequences). For example, the56it pattern ‘101’ exists once as a part of the correct alignment in Figure 6.2, butit also appears five times in other different locations that are not included in thecorrect alignment. To improve the accuracy of finding the diagonally-consecutivematches, we increase the length of the diagonal vector to be examined to fourbits. We also experimentally evaluate different window sizes in Figure 6.3. Wefind that a window size of 4 columns provides the highest filtering accuracy with-out falsely rejecting similar sequences. This is because individual matches (i.e.,single zeros) are usually useless and they are not necessarily part of the com-mon subsequences. As we increase the search window size, we are ignoring theseindividual matches and instead we only look for longer streaks of consecutivezeros.Shouji finds the diagonally-consecutive matches that are part of the commonsubsequences in the neighborhood map in two main steps. Step 1: For each searchwindow, Shouji finds a 4-bit diagonal vector that has the largest number of zeros.Shouji greedily considers this vector as a part of the common subsequence as ithas the least possible number of edits (i.e., 1’s). Finding always the maximumnumber of matches is necessary to avoid overestimating the actual number ofedits and eventually preserving all similar sequences. Shouji achieves this step bycomparing the 4 bits of each of the 2 E +1 diagonal vectors within a search windowand selects the 4-bit vector that has the largest number of zeros. In the case wheretwo 4-bit subsequences have the same number of zeros, Shouji breaks the ties byselecting the first one that has a leading zero. Then, Shouji slides the searchwindow by a single column (i.e., step size = 1 column) towards the last bottomright entry of the neighborhood map and repeats the previous computations.Thus, Shouji performs “Step 1” m times using m search windows, where m isthe sequence length. Step 2: The last step is to gather the results found foreach search window (i.e., 4-bit vector that has the largest number of zeros) andconstruct back all the diagonally-consecutive matches. For this purpose, Shoujimaintains a Shouji bit-vector of length m that stores all the zeros found in theneighborhood map as we illustrate in Figure 6.2. For each sliding search window,Shouji examines if the selected 4-bit vector maximizes the number of zeros inthe Shouji bit-vector at the same corresponding location. If so, Shouji stores the57 3% 17% 4% 1%0%15%30%45%60% 1 2 3 4 F a l s e A cc e p t R a t e Window Size (bits) Figure 6.3: The effect of the window size on the rate of the falsely accepted se-quences (i.e., dissimilar sequences that are considered as similar ones by Shoujifilter). We observe that a window width of 4 columns provides the highest ac-curacy. We also observe that as window size increases beyond 4 columns, moresimilar sequences are rejected by Shouji, which should be avoided.selected 4-bit vector in the Shouji bit-vector at the same corresponding location.This is necessary to avoid overestimating the number of edits between two givensequences. The common subsequences are represented as streaks of consecutivezeros in the Shouji bit-vector. The last step of Shouji is to calculate the total number of edits (i.e., ones) inthe Shouji bit-vector. Shouji examines if the total number of ones in the Shoujibit-vector is greater than E . If so, Shouji excludes the two sequences from theoptimal alignment calculation. Otherwise, Shouji considers the two sequencessimilar within the allowed edit distance threshold and allows their optimal align-ment to be computed using optimal alignment algorithms. The Shouji bit-vectorrepresents the differences between two sequences along the entire length of thesequence, m . However, Shouji is not limited to end-to-end edit distance calcula-tion. Shouji is also able to provide edit distance calculation in local and glocal58 ead : TTTTACTGTTCTCCCTTTGAATACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAATTACGTTTTAAAA Reference : TTTTACTGTTCTCCCTTTGAAATGACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAAATTACCGTTTTUpper Diagonal-4 : ----110111111100111111110101100001010001011010011111101101100110110011010101011101111111101011000000Upper Diagonal-3 : ---0110110101011111111111110111111111110010011110111111001000100100010011111110110111111000000110001Upper Diagonal-2 : --00111101100101101110110000000000000000000000000000000000000000000000000000000000000000010111110011Upper Diagonal-1 : -000111110111001001100011101111111111100100111101111110010001001000100111111101101111110111111110111Main Diagonal : 0000000000000000000001110110000101000101101001111110110110011011001101010101110111111111101111111111Lower Diagonal-1 : 000111110111001001101011010111111111011111011111101111111011111101111011111100001011010101101111111-Lower Diagonal-2 : 00111101100101101111011111100100010101110011100111011011111111111111010101111011010101001100111111--Lower Diagonal-3 : 0110110101011111111010110101111111011110111111111101101101111110111110111101111111111111110011111---Lower Diagonal-4 : 110111111100111110110001111100000101110101100111110010100111110011100100111101011011111111000111----Shouji bit-vector : 0000000000000000000100010000000000000000000000000000000000000000000000000000000000000000000001000000 Figure 6.4: An example of applying Shouji filtering algorithm to a sequencepair, where the edit distance threshold is set to 4. We present the content ofthe neighborhood map along with the Shouji bit-vector. We apply the Shoujialgorithm starting from the leftmost column towards the rightmost column.(semi-global) fashion. For example, achieving local edit distance calculation re-quires ignoring the ones that are located at the two ends of the Shouji bit-vector.Achieving glocal edit distance calculation requires excluding the ones that arelocated at one of the two ends of the Shouji bit-vector from the total count of theones in the Shouji bit-vector. This is important for correct pre-alignment filter-ing for global, local, and glocal alignment algorithms. We present an example ofapplying Shouji filtering algorithm in Figure 6.4. Shouji filter does not filter out similar sequences; hence, it provides zero false re-ject rate. The reason behind that is the way we find the identical subsequences.We always look for the subsequences that has the largest number of zeros, suchthat we maximize the number of matches and minimize number of edits that causethe division of one long identical sequence into shorter subsequences. This alsoallows for a very small portion of dissimilar sequences to pass. Next, we analyzethe computational complexity (i.e., asymptotic run time and space complexity)of our Shouji filter. Shouji filter divides the problem of finding the identicalsubsequences into at most m subproblems, as described in Algorithm 6.1. Eachsubproblem examines each of the 2 E +1 bit-vectors and finds the 4-bit subse-quence that has the largest number of zeros within the sliding window. Oncefound, Shouji filter also compares the found subsequence with its corresponding59ubsequence of the Shouji bit-vector. Now, let c be a constant representing therun time of examining each 4 bits of each bit-vector. Then the time complexityof the Shouji algorithm is as follows: T Shouji (m) = c . m . (2 E + 1) (6.2)This demonstrates that the Shouji algorithm runs in linear time with respectto the sequence length and edit distance threshold. Shouji algorithm maintains2 E +1 diagonal bit-vectors and an additional auxiliary bit-vector (i.e., Shoujibit-vector) for each two given sequences. The space complexity of the Shoujialgorithm is as follows: D Shouji (m) = m . (2 E + 2) (6.3)Hence, the Shouji algorithm requires linear space with respect to the read lengthand edit distance threshold. Next, we outline the hardware implementation de-tails of Shouji filter. 60 lgorithm 6.1 : Shouji Input : Seq Output : Pseudocode: m ← length(Seq ; 2: // Build Neighborhood map (N) for i ← do for j ← i-E to i+E do if Seq then N[i,j] ← ; else N[i,j] ← ; // Sliding window search, function CZ() returns for i ← do Shouji[i] ← ; 10: for i ← do for j ← do // Compare with upper diagonal and lower diagonal if CZ(N[i+j:i+3+j,i:i+3]) > CZ(N[i:i+3,i+j:i+3+j]) then Z ← N[i+j:i+3+j,i:i+3] ; 15: else if CZ(N[i+j:i+3+j,i:i+3]) == CZ(N[i:i+3,i+j:i+3+j]) then if N[i+j,i]==0 then Z ← N[i+j:i+3+j,i:i+3] ; 17: else if N[i,i+j]==0 then Z ← N[i:i+3,i+j:i+3+j] ; 18: else Z ← N[i:i+3,i+j:i+3+j] ; 19: // Compare Z with main diagonal and Shouji bit-vector if CZ(N[i:i+3,i:i+3]) > CZ(Z) then Z ← N[i:i+3,i:i+3]; if CZ( Z) > CZ(Shouji[i:i+3]) then Shouji[i:i+3] ← Z ; 23: if CZ(Shouji) ≥ m-E then return ; 24: else return ; To make the best use of the available resources in the FPGA chip, our algorithmutilizes the operations that are easily supported on an FPGA, such as bitwiseoperations, bit shifts, and bit count. To build the neighborhood map on theFPGA, we use the observation that the main diagonal can be implemented usinga bitwise XOR operation between the two given sequences. The vacant bits61 mmText . . . . . . m search windows for processing sequences of length m characters PatternEdit distance threshold Search Window m E +1 diagonals . . .. . . Search Window m-1 E +1 diagonals . . .. . . E +1 diagonals . . .. . . B u il d i ng N e i ghbo r hood M a p ( E + d i agona l b i t - v e c t o r s ) . . . Z Z m-1 Z m . . . . . . se l ec t t h e vec t o r t h a t h as t h e h i gh es t o f ' s Search Window 1 E +1 ≥ m-E ? 1: similar0: dissimilar Step 1 Step 2 Step 3 Figure 6.5: A block diagram of the sliding window scheme implemented in FPGAfor a single filtering unit.due to the shift operation are filled with ones. The upper E diagonals can beimplemented by gradually shifting the pattern ( P ) to the right-hand directionand then performing bitwise XOR with the text ( T ). This allows each characterof P to be compared with the right-hand neighbor characters (up to E characters)of its corresponding character of T . The lower E diagonals can be implemented ina way similar to the upper E diagonals, but here the shift operation is performedin the left-hand direction. This ensures that each character of P is comparedwith the left-hand neighbor characters (up to E characters) of its correspondingcharacter of T . We also build an efficient hardware architecture for each searchwindow of the Shouji algorithm. It quickly finds the number of zeros in each 4-bitvector using a hardware look-up table that stores the 16 possible permutations ofa 4-bit vector along with the number of zeros for each permutation. As presentedin Figure 6.5, each counter counts the number of zeros in a single bit-vector. Thecounter takes four bits as input and generate three bits that represents the numberof zeros within the window. Each counter requires three 4-input LUTs. In total,we need 6 E +6 4-input LUTs to build a single search window. All bits of thecounter output are generated at the same time, as the propagation delay throughan FPGA look-up table is independent of the implemented function [135]. Thecomparator is responsible for selecting the 4-bit subsequence that maximizes thenumber of consecutive matches based on the output of each counter and the Shouji62it-vector. Finally, the selected 4-bit subsequence is then stored in the Shouji bit-vector at the same corresponding location. Our hardware implementation of theShouji filtering unit is independent of the specific FPGA-platform as it does notrely on any vendor-specific computing elements (e.g., intellectual property cores). We propose Shouji, a highly parallel and accurate pre-alignment filter designedon a specialized hardware platform. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting allidentical subsequences shared between two given subsequences. This way leads toaddress the first two causes of filtering inaccuracy in SHD (i.e., random zeros andconservative counting). The second key idea is to avoid the filtering inaccuracycaused by the leading and trailing zeros (as we discuss in Chapter 3). Shoujireplaces the vacant bits that result from shifting the read with ones.63 hapter 7MAGNET: Accurate HardwarePre-Alignment Filter In this chapter, we introduce MAGNET, a pre-alignment filtering algorithm toachieve highly accurate filtering. MAGNET is a filtering heuristic that aims atfinding all non-overlapping long streaks of consecutive zeros in the neighborhoodmap using a divide-and-conquer approach. We discuss the details of MAGNETalgorithm next. MAGNET uses a divide-and-conquer technique to find all the E +1 identical sub-sequences, if any, and summing up their length. By calculating their total length,we can estimate the total number of edits between the two given sequences. If thetotal length of the E +1 identical subsequences is less than m-E , then there existmore identical subsequences than E +1 that are associated with more edits thanallowed. If so, then MAGNET rejects the two given sequences without performingthe alignment step. The filtering strategy of MAGNET makes three observationsbased on the pigeonhole principle to examine each mapping accurately.64. Given that the user-defined edit distance threshold, E , is usually less than5% of the read length ( m ) [18, 65, 122, 62], the identical subsequence isusually long and ranges from a single pairwise-match to m pairwise-matcheslong.2. The length of the longest identical subsequence is strictly not less than (cid:100) (( m − E )) / (( E + 1)) (cid:101) and can be at most m pairwise-matches long (i.e.,equivalent to the sequence length). The upper bound is trivial and holdswhen the alignment is free of edits. The lower bound equality occurs whenall edits are equispaced and all E +1 subsequences are of the same length.3. We observe that the E +1 identical subsequences that are part of the cor-rect alignment are always non-overlapping. We also observe that the 2 E +1diagonal bit-vectors of the neighborhood map contain other identical sub-sequences that are always short. This raises a fundamental question aboutwhether the E +1 identical subsequences need to be strictly non-overlappingor only long and not necessarily non-overlapping. Next, we investigate bothcases by providing two algorithms for finding the longest identical subse-quences. MAGNET pre-alignment filter identifies the incorrect mappings, without calcu-lating the optimal alignment, in three main steps. (1) It first constructs the neigh-borhood map (described in Chapter 6). (2) It then identifies all non-overlappingidentical subsequences in the neighborhood map using a divide-and-conquer ap-proach. (3) And finally it makes a decision (it accepts or rejects the given se-quences) based on the length of the found identical subsequences.65 .2.1 Method 1: Building the Neighborhood Map The first step of MAGNET algorithm is building our binary m by m neighborhoodmap that we describe in Chapter 6. Similarly with Shouji filter, MAGNET filterstarts with building the 2 E +1 diagonal bit-vectors of the neighborhood map forthe two given sequences. We use the neighborhood map to represent all the pair-wise matches between the read and the reference sequences. The neighborhoodmap considers also the presence of substitutions, insertions, and deletions. Nextstep is to find the consecutive matches that are part of the correct alignment. E +1 Identical Subse-quences There are two methods to identify the identical subsequences. Either to find thenon-overlapping subsequences of consecutive zeros or find the E +1 top longestsubsequences of consecutive zeros. We describe both methods and show whichone is more effective. E +1 non-overlapping subsequences Finding the E +1 non-overlapping subsequences in the neighborhood map involvesthree main steps.1. Extraction. Each diagonal bit-vector nominates its local longest subse-quence of consecutive zeros. Among all nominated subsequences, a singlesubsequence is selected as a global longest subsequence based on its length.(Once found, MAGNET evaluates if its length is less than is strictly notless than (cid:100) (( m − E )) / (( E + 1)) (cid:101) , then the two sequences contains more ed-its than allowed, which cause the identical subsequences to be shorter (i.e.,each edit results in dividing the sequence pair into more identical subse-quences). If so, then the two sequences are rejected. Otherwise, MAGNET66tores its length to be used towards calculating the total length of all E +1identical subsequences.2. Encapsulation. The next step is essential to preserve the original edit(or edits) that causes a single identical sequence to be divided into smallersubsequences. MAGNET penalizes the found subsequence by two edits (onefor each side). This is achieved by excluding from the search space of allbit-vectors the indices of the found subsequence in addition to the index ofthe surrounding single bit from both left and right sides.3. Divide-and-Conquer Recursion. In order to locate the other E non-overlapping subsequences, MAGNET applies a divide-and-conquer tech-nique where we decompose the problem of finding the non-overlapping iden-tical subsequences into two subproblems. While, the first subproblem fo-cuses on finding the next long subsequence that is located on the right-handside of the previously found subsequence in the first extraction step, the sec-ond subproblem focuses on the other side of the found subsequence. Eachsubproblem is solved by recursively repeating all the three steps mentionedabove, but without evaluating again the length of the longest subsequence.MAGNET applies two early termination methods that aim at reducing theexecution time of the filter. The first method is evaluating the length of thelongest subsequence in the first recursion call. The second method is limit-ing the number of the subsequences to be found to at most E +1, regardlesstheir actual number for each two sequences. E +1 longest subsequences Alternatively, MAGNET can be changed to find only top E +1 longest sub-sequences without the restriction of being non-overlapping ones. This can beachieved by maintaining a binary max-heap priority queue [138], where it storesthe length of each of the E +1 subsequences from each mask.67 dit Distance ThresholdNon-overlappingTop LongestNon-overlappingAny Longest1 0.00% 0.00% 46 642 0.01% 0.09% 268 26743 0.04% 0.68% 1110 205044 0.08% 4.76% 2461 1429065 0.18% 18.85% 5457 565463 F a l s e A cc e p t R a t e Edit Distance Threshold (E) Non-overlappingTop Longest Figure 7.1: The false accept rate of MAGNET using two different algorithmsfor identifying the identical subsequences. We observe that finding the E +1non-overlapping identical subsequences leads to a significant reduction in theincorrectly accepted sequences compared to finding the top E +1 longest identicalsubsequences.In total, the queue stores up to ( E +1).(2 E +1) elements. The length of the toplongest subsequence is always stored at the root of the heap (the heap property).We need to extract the root of the heap structure E +1 times in order to find thetotal length of the top E +1 longest subsequences. We evaluate both algorithmsfor finding the longest identical subsequences in Figure 7.1. We observe that theaccuracy of finding top E +1 longest subsequences degrades exponentially as theedit distance threshold increases. We also observe that the accuracy of findingthe non-overlapping subsequences remains almost linear and provides consider-ably more accurate filtering. Thus, we consider identifying only non-overlappingsubsequences throughout the following sections. The last step of MAGNET is to decide if the mapping is potentially correct andneeds to be examined by read alignment. Once after the termination, if the totallength of all found identical subsequences is less than m-E then the two sequencesare rejected. Otherwise, they are considered to be similar and the alignment canbe measured using sophisticated alignment algorithms.68 ead : TTTTACTGTTCTCCCTTTGAATACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAATTACGTTTTAAAA Reference : TTTTACTGTTCTCCCTTTGAAATGACAATATATCTATATTTCCCTCTGGCTACATTTAAAATTTCCCCTTTATCTGTAATAATCAGTAAATTACCGTTTTUpper Diagonal-4 : ----110111111100111111110101100001010001011010011111101101100110110011010101011101111111101011000000Upper Diagonal-3 : ---0110110101011111111111110111111111110010011110111111001000100100010011111110110111111000000110001Upper Diagonal-2 : --00111101100101101110110000000000000000000000000000000000000000000000000000000000000000010111110011Upper Diagonal-1 : -000111110111001001100011101111111111100100111101111110010001001000100111111101101111110111111110111Main Diagonal : 0000000000000000000001110110000101000101101001111110110110011011001101010101110111111111101111111111Lower Diagonal-1 : 000111110111001001101011010111111111011111011111101111111011111101111011111100001011010101101111111-Lower Diagonal-2 : 00111101100101101111011111100100010101110011100111011011111111111111010101111011010101001100111111--Lower Diagonal-3 : 0110110101011111111010110101111111011110111111111101101101111110111110111101111111111111110011111---Lower Diagonal-4 : 110111111100111110110001111100000101110101100111110010100111110011100100111101011011111111000111----MAGNET bit-vector : 000000000000000000000101000000000000000000000000000000000000000000000000000000000000000001000100000012 345 Figure 7.2: An example of applying MAGNET filtering algorithm with an editdistance threshold of 4. MAGNET finds all the longest non-overlapping subse-quences of consecutive zeros in the descending order of their length (as numberedin yellow).We provide an example of applying MAGNET filtering algorithm in Figure7.2. In this section, we analyze the asymptotic run time and space complexity ofMAGNET algorithm. We provide the pseudocode of MAGNET in Algorithm7.1. We also provide the divide-and-conquer algorithm that MAGNET uses inAlgorithm 7.2. MAGNET applies a divide-and-conquer technique that dividesthe problem of finding the identical subsequences into two subproblems in eachrecursion call. In the first recursion call, the extracted identical subsequenceis of length at least a = (cid:100) (( m-E )) / (( E + 1)) (cid:101) bases. This reduces the problem offinding the identical subsequences from m to at most m-a , which is further dividedinto two subproblems: a left subproblem and a right subproblem. For the sake ofsimplicity, we assume that the size of the left and the right subproblems decreasesby a factor of b and c , respectively, as follows: m = a + 2 + m/b + m/c (7.1)The addition of 2 bases is for the encapsulation bits added at each recursioncall. Now, let T MAGNET (m) be the time complexity of MAGNET algorithm, foridentifying non-overlapping subsequences.69 lgorithm 7.1 : MAGNET Input : Seq1, Seq2, Edit distance threshold (E). Output : Pseudocode: 1: m ← length(Seq1); 2: // Build Neighborhood map (N) 3: for i ← 1 to m do 4: for j ← i-E to i+E do 5: if Seq1[i] = Seq2[j] then 6: N[i,j] ← 0; 7: else 8: N[i,j]← 1; 9: for i ← 1 to m do 10: MAGNET[i] ← 1; 11: [MAGNET, calls] ← EXEN(N, 1, m, E, MAGNET, 1); 12: // Function CZ() returns number of zeros 13: if CZ(MAGNET) ≥ m-E then return 1; else return 0; If it takes O( km ) time to find the global longest subsequence and divide theproblem into two subproblems, where k = 2 E +1 is the number of bit-vectors, weget the following recurrence equation: T MAGNET (m) = T MAGNET (m/b) + T MAGNET (m/c) + O ( km ) (7.2)Given that the early termination condition of MAGNET algorithm restricts therecursion depth as follows: Recursion tree depth = (cid:100) log ( E + 1) (cid:101) − T MAGNET (m) = O ( km ) . (cid:100) log ( E +1) (cid:101)− (cid:88) x =0 (cid:18) c + 1 b (cid:19) x T MAGNET (m) ≈ O ( f km ) (7.4)70here f is a fraction number satisfies the following range: 1 ≤ f < 2. This inturn demonstrates that the MAGNET algorithm runs in a linear time with respectto the sequence length and edit distance threshold and hence it is computationallyinexpensive. The space complexity of MAGNET algorithm is as follows: D MAGNET (m) = D MAGNET (m/b) + D MAGNET (m/c) + O ( km + m ) D MAGNET (m) ≈ O ( f km + f m ) (7.5)Hence, MAGNET algorithm requires a linear space with respect to the readlength and edit distance threshold. Next, we outline the hardware implementationdetails of MAGNET filter. In this section, we outline the challenges that are encountered in implementingMAGNET filter to be used in our accelerator design. Implementing MAGNETalgorithm is challenging due to the random location and variable length of each ofthe E +1 identical subsequences. The Verilog-2011 imposes two challenges on ourarchitecture as it does not support variable-size partial selection and indexingof a group of bits from a vector [139]. In particular, the first challenge lies inexcluding the extracted identical subsequence along with its encapsulation bitsfrom the search space of the next recursion call. The second challenge lies individing the problem into two subproblems, each of which has an unknown sizeat design time. To address these limitations and tackle the two design challenges,we keep the problem size fixed and at each recursion call. We exclude the foundlongest subsequence from the search space by amending all bits of all 2 E +1 bit-vectors that are located within the indices (locations) of the encapsulation bits to‘1’s. This ensures to omit the found longest subsequence and all its correspondinglocations in the other bit-vectors from the following recursion calls.71 lgorithm 7.2 : EXEN function Function : EXEN() extracts the top longest subsequence of consecutive zeros and generate two subproblems. Input : Neighborhood map ( N ), start index ( SI ), end index ( EI ), E , MAGNET bit-vector, number of recursion calls . Output : updated MAGNET bit-vector, updated number of calls . Pseudocode: // Early termination condition if (SI ≤ EI and calls ≤ E+1) then // Function CCZ() returns number and indices of longest // subsequence of diagonally consecutive zeros for j ← do //Extraction [X,s1,e1] ← CCZ(N[SI+j,SI],EI) ; // Lower diagonal [Y,s2,e2] ← CCZ(N[SI,SI+j],EI) ; // Upper diagonal if X > Y then s ← s1 ; e ← e1 ; 10: else s ← s2 ; e ← e2 ; 12: [X,s1,e1] ← CCZ(N[SI,SI],EI) ; 13: if X > (e-s+1) then s ← s1 ; e ← e1 ; 15: // Early termination condition (only in first call) if (calls=1 and (e-s+1)< ⌈(𝑚 − 𝐸)/(𝐸 + 1)⌉ ) then return [MAGNET, 0] ; 18: // Left subproblem with encapsulation [MAGNET, calls] ← EXEN(N,SI, s-2, E, MAGNET, calls+1) ; 20: // Right subproblem with encapsulation [MAGNET, calls] ← EXEN(N,e+2,EI, E,MAGNET, calls+1) ; 22: return [MAGNET, calls] ; 23: else return [MAGNET, calls-1] ; We introduce MAGNET, a new filtering strategy that remarkably improves theaccuracy of pre-alignment filtering and provides a minimal number of falselyaccepted mappings. 72AGNET gets rid of the first three causes of filtering inaccuracy that weobserved in SHD [65] (see Chapter 3). We believe that MAGNET is the mostaccurate pre-alignment filter in literature today but this comes at the expense ofhardware implementation challenges. 73 hapter 8SneakySnake: Fast, Accurate,and Cost-Effective Filter In this chapter, we address the issue of the long execution time of read alignmentusing a different approach. The use of specialized hardware chips can yield sig-nificant performance improvements. However, the main drawbacks are the higherdesign effort, the limited data throughput, and the necessity of algorithm redesignand tailored implementation to fully leverage the FPGA architecture. To tacklethese challenges and obviate these difficulties, we introduce fast, accurate, andyet cost-effective pre-alignment filter. We call it SneakySnake. It leverages thetoday’s general purpose processor (i.e., CPU), which is largely available to bioin-formaticians at no additional cost. Beside the CPU implementation of SneakyS-nake, we also provide an efficient hardware architecture to further accelerate thecomputations of the SneakySnake algorithm with high parallelism. The key idea of SneakySnake filter is that it needs to know only if two sequencesare dissimilar by more than the edit distance threshold or not.74his fact leads to the following question. Can one approximate the editdistance between two sequences much faster than calculating the editdistance? Given reference sequence A [1. . . m ], read sequence B [1. . . m ], and anedit distance threshold E , SneakySnake calculates the approximate edit distancebetween A and B and then checks if it is greater than the edit distance thresholdthen the two sequences A and B are rejected. Otherwise, the two sequences A and B are accepted by SneakySnake and the optimal alignment is calculated.The approximate edit distance calculated by SneakySnake should always be lessthan or equal to the edit distance threshold as long as the actual edit distancedoes not exceed the edit distance threshold. We convert the approximate edit distance problem to as a restricted optimal pathfinding problem in a grid graph. We call this the Sneaky Snake Problem. It canbe summarized as a traversal problem in a special grid graph, where a snaketravels in the grid graph with the presence of randomly distributed obstacles.The goal is to find the path that connects the origin and the destination pointswith the minimal number of obstacles along the path.We describe the grid graph of the Sneaky Snake Problem as follows: • There is a two dimensional m by m grid graph, where m is the read length.The grid cells are aligned in rows and columns and all cells have the samecost. The snake wants to travel through the grid from the top left cornertowards the bottom right corner of the grid. There are E +1 entrances thatare next to the cells of the first column and similarly, there are E +1 exitsthat are located next to the cells of the last column. • The snake is allowed to travel through dedicated pipes. Each pipe is repre-sented as a stretch of diagonally consecutive pairwise matches. We definethe path in this problem as a sequence of diagonal grid cells.75n other words, the movement through the vertical cells is not countedtowards the length of the path. The length of the path is simply the totalnumber of diagonal cells along the path. • The snake is only able to switch between the pipes after skipping an obstacle(i.e., end of the pipe). The obstacle represents a pairwise mismatch. Wemodel the obstacle as a solid black grid cell. Traveling across an obstaclerequires the snake to diagonally move one step ahead, and hence the obstaclecell is counted towards the total length of the path. The snake is allowed toavoid only E obstacles by transferring to another pipe or jumping over theobstacle within the same diagonal vector. This can be seen as the numberof E tries required to arrive the destination.The general goal of this problem is to find a path in the grid graph with theminimum number of obstacles. The path starts at the first leftmost cell andtravels out of the grid while exiting onto a destination. SneakySnake filtering al-gorithm can approximate the edit distance using two different approaches. Whileone of them relies on what we call weighted neighborhood maze , the other ap-proach relies on unweighted neighborhood maze . Both approaches involve threemain steps. Next, we describe the three steps of the first approach. (1) It constructs theweighted neighborhood maze. (2) It identifies the optimal travel path that hasthe least number of obstacles. (3) The mapping is accepted if and only if thesnake survives the grid maze. 76 .3.1 Method 1: Building the Weighted NeighborhoodMaze The first step in the SneakySnake algorithm is to build the m by m weightedneighborhood maze ( WN ). It is a modified version of our original neighborhoodmaze (see Chapter 6), where we change the meaning of the content of each cell.First, we change the value of the grid cell that represents a pairwise match fromzero to the total number of the consecutive zeros in its lower right neighborswithin the same diagonal vector. Second, we change the value of the grid cellthat represents a pairwise mismatch from one to zero (modeled as a solid blackcell). Given i and j (where 1 ≤ i ≤ m and i-E ≤ j ≤ i+E ), the entry WN [ i, j ]of the weighted neighborhood maze can be calculated as follows: WN [ i,j ] = W N [ i + 1 , j + 1] + 1 A [ i ] = B [ j ]0 A [ i ] (cid:54) = B [ j ] (8.1)Computing each cell depends on its immediate lower right cell. This restrictsthe order of the computations to be performed starting from the lower rightcorner towards the upper left corner. We present in Figure 8.1 an example ofa weighted neighborhood maze for two sequences, where the sequence B differsfrom the sequence A by three edits (i.e., obstacles). The second step of the SneakySnake algorithm is to decide on which pipe the snakeshould travel through, using three main steps. (1) The snake essentially uses theprecomputed weight of each cell in the weighted neighborhood maze to make thedecision. This is the key reason behind filling the weighted neighborhood mazefrom the lower right corner to the upper left corner, while the snake travels fromthe upper left corner towards the lower right corner.77 G G T G G A G A G A T C i1 G G T G A G A G T T G T E X I T E N T R A N C E Figure 8.1: Weighted Neighborhood maze ( WN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGT for E =3. Thesnake traverses the path that is highlighted in yellow which includes three ob-stacles and three pipes, that represents three identical subsequences: GGTG,AGAG, and TThe snake examines the first cell of each diagonal vector in the first column ofthe grid and finds the cell with the largest weight. (2) If the snake is unable todecide on which pipe to follow (i.e., all cells of the first column are obstacles), thesnake skips the current column and consumes one attempt out of the E tries. Thesnake then finds the maximum weight of the cells in the next column. If it findsmore than one pipe with the same length, it always chooses the first one (startingfrom the upper diagonal). (3) Once found, it travels through the pipe until itfaces an obstacle. It repeats these three steps until it reaches its destination orconsumes all the E tries. The last step is to check if the snake makes it through the grid maze or not. Givena limited number of tries (equals to E ), if the snake arrives the destination, thenthere exists a path that has at most E obstacles.78n the context of approximate edit distance calculation, it means that thereexists an alignment that has at most E edits. The SneakySnake algorithm con-siders this mapping as a correct one and accepts this mapping. Otherwise, itrejects the mapping without performing read alignment step. Here, we explain the three steps of solving the Sneaky Snake Problem using theunweighted neighborhood maze approach. (1) It first constructs the unweightedneighborhood maze. (2) It then identifies the optimal travel path that has theleast number of obstacles. (3) And finally the mapping is accepted if and only ifthe snake survives the grid maze without exceeding the limited number of tries. In the unweighted neighborhood maze, we introduce two changes to the way webuild the neighborhood maze. First, each grid cell has no weight assigned to it.Second, we define a state for each grid cell, whereas a cell can be either in anavailable state or a blocked state. We set the grid cell to be in an available stateif it represents a pairwise match. If the grid cell represents a pairwise mismatch,then we set its state to blocked. In this way, the unweighted neighborhood mazeeliminates the data dependency between the grid cells that exists in the weightedneighborhood maze. Given i and j (where 1 ≤ i ≤ m and i-E ≤ j ≤ i+E ), thestate of the entry UN [ i, j ] of the unweighted neighborhood maze can be chosenas follows: UN [ i,j ] = Available A [ i ] = B [ j ] Blocked A [ i ] (cid:54) = B [ j ] (8.2)79 G G T G G A G A G A T C i1 G G T G A G A G T T G T E X I T E N T R A N C E Figure 8.2: Unweighted neighborhood maze ( UN ), for reference sequence A =GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGT for E =3. Thesnake traverses the path that is highlighted in yellow which includes three ob-stacles and three pipes, that represents three identical subsequences: GGTG,AGAG, and T.We present in Figure 8.2 an example of an unweighted neighborhood maze fortwo sequences, where the sequence B differs from the sequence A by three edits.The state of the entry UN [ i,j ] is set to available if the i th character of the readsequence matches the j th character of the reference sequence. Otherwise, it is setto blocked (highlighted in a black). The second step of the SneakySnake algorithm is to decide on which pipe the snakeshould consider, in three main steps. (1) As the snake has no prior informationon the length of each pipe (as in the weighted neighborhood maze), it uses itstelescoping lens to perform depth-first search (DFS) [140] each time it faces ablocked cell (i.e., obstacle). The DFS algorithm traverses the first upper diagonalvector to reach the maximum possible depth (i.e., until it arrives a blocked cellor the end of the vector). If the DFS algorithm reaches the exit of the maze, thenSneakySnake terminates the DFS search.80he depth of each grid cell is equal to the value of its j index. It stores themaximum possible depth and backtracks to the starting point (i.e., the obstaclecell). It then continues traversing the next unsearched diagonal vector and repeatsthe previous steps. (2) The snake selects the path with the largest depth. If itis unable to find one (i.e., all cells of the first column are obstacles), it skips thecurrent column and consumes one path transfer out of the E allowed transfers.(3) If it finds multiple equi-length pipes, it always chooses the first one (startingfrom the upper diagonal). Once found, it travels through the pipe until it facesanother obstacle. It repeats these three steps until it reaches its destination orconsumes all the E tries. The last step is to accept the mapping based on whether the snake arrives itsdestination given a limited number of path transfers (equals to E ). This meansthat there is an alignment for the two given sequences that has at most E edits.Otherwise, The SneakySnake algorithm rejects the mapping without performingread alignment step. In this section, we analyze the asymptotic run time and space complexity ofthe SneakySnake algorithm. The SneakySnake algorithm builds the weightedneighborhood maze by traversing through each vector starting from the lowerright corner. It compares the corresponding characters of the two given sequencesand if they match each other, then it updates the current cell with the result ofsumming up one and the value of the previous cell. Assuming it takes O( m ) timeto build one diagonal vector, where m is the read length. Then it takes O( km )to build the entire weighted maze, where k = 2 E +1 is the number of the gridvectors. 81pon arriving an obstacle, the SneakySnake algorithm examines the weight ofeach cell following the obstacle cell and picks the cell with the largest weight asa new starting point. With the existence of E obstacles, this step is repeated atmost E times. Each time it takes O( k ) time to compare the weight of k cells.Thus, the upper-bound on the time complexity of SneakySnake using weightedneighborhood maze is given as follows: T W eighted SneakySnake (m) = O ( Em + hk ) (8.3)Where h is a number satisfies the following range: 0 ≤ h ≤ E .Using the unweighted neighborhood maze, the SneakSnake algorithm doesnot necessarily traverse all the cells of each diagonal vector (as in the weightedapproach described above). Thus, its asymptotic run time is not determined.On the one hand, the lower-bound on the its time complexity is O( m ), which isachieved when the DFS algorithm reaches the exist of the maze without facingany obstacle along the path. On the other hand, the loose upper-bound run timecomplexity is also equal to O( km+hk ) when the DFS traverses through nearlythe entire unweighted neighborhood maze. However, it is unrealistic to traversethe entire maze, as in this case the value of each and every cell of the entire mazeshould be equal to ‘0’ (for example, when all characters of the two sequencesare ‘A’s). If this is the case, then the DFS algorithm traverses only through thefirst upper diagonal and then get terminated. Thus, the run time complexity ofSneakySnake with the unweighted neighborhood maze is given as follows: O ( m ) ≤ T Unweighted SneakySnake (m) < O ( km + hk ) (8.4)Where h is a number satisfies the following range: 0 ≤ h ≤ E . This in turndemonstrates that the unweighted neighborhood maze algorithm is asymptoti-cally inexpensive compared to the weighted SneakySnake algorithm. Hence, weconsider the unweighted SneakySnake algorithm for further analysis and eval-uation. We provide the pseudocode of SneakySnake in Algorithm 8.1. Whilethe weighted SneakySnake requires no additional auxiliary space, the weightedapproach requires only storing the depth of at most 2 E +1 vectors.82 lgorithm 8.1 : SneakySnake Input : Seq1, Seq2, Edit distance threshold (E). Output : Pseudocode: m ← length(Seq1) ; 2: // Build unweighted Neighborhood maze (UN) for i ← do for j ← i-E to i+E do if Seq1[i] == Seq2[j] then N[i,j] ← ; else N[i,j] ← ; // Find Best Path with least number of obstacles i ← 1; e ← while i ≤ m and e ≤ E do 11: LargestDepth = 0; 12: for j ← do count ← ii ← //UN_Diagonal[diagonal id][cell index] while UN_Diagonal[j][i+ii] == 0 do Depth ← i; ii++; if (Depth>LargestDepth) LargestDepth ← Depth ; 20: i ← LargestDepth + 1 ; 21: e ← e + 1 ; 22: // Examine the snake arrival if e-1 ≤ E then return ; 23: else return ; We introduce an FPGA-friendly architecture for the unweighted SneakySnakealgorithm. Achieving an efficient hardware architecture raises the following ques-tion: Can one solve many small sub-problems of the Sneaky SnakeProblem with a high parallelism by reducing the search space of theSneakySnake algorithm? t columns (the height is always fixed to 2 E +1 rows) forsome parameter t . We then apply the SneakySnake algorithm to each sub-matrixindependently from the other sub-matrices. This results in three key benefits.First, downsizing the search space into a reasonably small sub-matrix with aknown dimension at the design time limits the number of all possible solutionsfor that sub-matrix. This reduces the size of the look-up tables (LUTs) requiredto build the architecture and simplifies the overall design. Second, this approachhelps to maintain a modular and scalable architecture that can be implementedfor any read length and edit distance threshold. Third, all the smaller sub-problems can be solved independently and rapidly with a high parallelism. Thisreduces the execution time of the overall algorithm as the SneakySnake algorithmdoes not need to evaluate the entire path. However, these benefits come at the costof accuracy degradation. The solution for each sub-problem is not necessarily partof the solution for the main problem (with the original size of m by m ). Though itis guaranteed to always choose the path with the least obstacles within the searchspace, our hardware architecture can underestimate the number of obstacles foundand thereby increase the false accept rate.Next, we present the details of our hardware architecture. We choose theparameter t to be 8 columns. This results in partitioning the unweighted neigh-borhood maze of size 100 by 100 (or 2 E +1 by 100 computed cells) into 13 sub-matrices, each of size 2 E +1 by 8. Each row in the sub-matrix is part of the di-agonal vector of the unweighted neighborhood maze. Each sub-matrix representsan individual Sneaky Snake Problem. Solving each problem requires determiningthe optimal path (with the least number of obstacles) along each sub-matrix usingfive main steps. (1) The first step is to perform the DFS search for finding theoptimal path along the 2 E +1 rows of the sub-matrix. We implement the DFSalgorithm on FPGA as a leading-zero counter (LZC). We use the LZC designproposed in [141]. It counts the number of consecutive zeros that appear in an-bit input word before the first more significant bit that is equal to one.84 alidC C C Valid_AValid_BCA CA CA CB CB CB CA < CB? 1:0 CA CB CA CB CA CB Figure 8.3: Proposed 4-bit LZC comparator.It generates two output signals, the log n bits of the leading-zero count C anda flag Valid , for an input word A = A n − , A n − , . . . , A , where A n − is the most-significant bit. When all input bits are set to zero, the Valid flag is set to one.For other cases of input, the value of C represents the number of leading zeros.For the case of an 8-bit input operand, the two output signals of the LZC aregiven by: C = A + A + A + A C = A + A + A · A · ( A + A ) C = A + A · A + A · A · A + A · A · A · A V alid = A + A + A + A + A + A + A + A (8.5)(2) The second step is to find the row that has the largest number of leading zeros.We build a hierarchical comparator structure with log (2 E +1) levels. We use asingle LZC comparator for comparing the number of leading zeros of two rows.The LZC comparator compares two numbers of leading zeros produced by twoLZC circuits and passes the largest number as output signals without changingtheir values (maintaining the same meaning). We provide in Figure 8.3 the pro-posed architecture of the 4-bit LZC comparator. For an edit distance thresholdsof 5 bp, we need 12 LZC comparators arranged into 4 levels (6, 3, 2, and 1 LZCcomparators in each level, respectively). We provide the overall architecture ofthe 4-level LZC comparator tree including the 11 LZC block diagrams in Figure8.4. 85 ZC1LZC2LZC3LZC4LZC5LZC6LZC7LZC8LZC9LZC10LZC11 LZCComp. 1LZCComp. 2LZCComp. 3LZCComp. 4LZCComp. 5LZCComp. 6 LZCComp. 7LZCComp. 8LZCComp. 9 LZCComp. 10 LZCComp. 12LZCComp. 11 C C C Valid Figure 8.4: Block diagram of the 11 LZCs and the hierarchical LZC comparatortree for computing the largest number of leading zeros in 11 rows.863) The third step is to store the largest number of leading zeros and shift thebits of all rows to the right direction. As we implement the DFS search as a LZCcircuit, we always need to start the DFS search space from the least significantbit of each row. The shift operation is necessary for deleting the leading zerosand allowing the snake to continue the process of finding the optimal path. Weshift each row of the sub-matrix by x +1 bits, where x is the largest number ofleading zeros found in step 2. This guarantees to exclude the found path fromthe next search round along with the edit, which divides the optimal path intoshorter paths.(4) The fourth step is to repeat the previous three steps in order to find theoptimal path from the least significant bit all the way to the most significantbit. The number of replications needed depends on the desired accuracy of theSneakySnake algorithm. If our target is to find at most a single edit within eachsub-matrix, then we need to build two replications for the steps described above.For example, let A be 00010000, where t = 8. The first replication computesthe value of x as four zeros and updates the bits of A to 11111000. The secondreplication computes the value of x as three zeros and updates the bits of A to11111111.(5) The last step is to calculate the total number of obstacles along the entireoptimal path from the least significant bit towards the most significant bit. Wefirst find out the number of replications that produces at least a single leadingzero (i.e., x ¿0). If it equals the total number of replications ( y ), then the numberof obstacles (edits) equals to the total number of replications included in thedesign. Otherwise, we compute the number of the obstacles as follows: min ( y, t − y (cid:88) o =1 x o ) (8.6)where y is the total number of replications involved in the design and x o is thelargest number of leading zeros produced by the replication of index o .87 .7 Discussion We introduce the SneakySnake algorithm and hardware architecture in order toaddress the question that we raise earlier in this chapter about weather approxi-mating the edit distance calculation can be faster than calculating the exact editdistance. The accuracy of the edit distance approximation is another concern.Our SneakySnake filter always underestimates the total number of edits. Thisis mainly due to the goal of the Sneaky Snake Problem that is the search foroptimal path with the least number of obstacles (or edits). This concern raisestwo key questions. (1) Can one improve the accuracy of edit distanceapproximation such that we achieve either a new optimal read aligneror a highly-accurate pre-alignment filter? The SneakySnake algorithm canbe slightly modified such that it considers a penalty on the selected path. If thesnake selects the upper diagonals, then this means that the obstacle is basicallya deleted character at the i th index of the read sequence. Similarly, if the currentpath is at the main diagonal and the next selected path is at the second upperdiagonal, this means that the obstacle is two deleted characters. The currentimplementation of SneakySnake considers this two deleted characters as a singleedit. The accuracy improvement of SneakySnake is yet to be explored in thisthesis.The second question is (2) Similarly to the way we partition the search space ofthe SneakySnake algorithm, Can one reduce the search space of exact editdistance algorithms such that only the necessary cells are computed? We observe the fact that the exact edit distance algorithms explore a large areaof the dynamic programming matrix (even with the banded matrix), which isunnecessary for highly dissimilar sequences. The edit distance is considered tobe a non-additive distance measure [30]. This means that its calculations cannot be distributed over concatenated subsequences of the long sequence. In otherwords, we can not divide the read-reference pair into shorter pairs and calculatethe exact edit distance for each short read-reference pair individually (aiming atconcurrently computing them) and then accumulate the results.88or example, take A = “CGCG”, B = “GCGC” and observe that calculatingthe edit distance for each character of A with its corresponding character of B yields four edits, while the edit distance between A and B , each as a whole, isonly 2 edits. This is clear from the way that its dynamic programming matrixis computed. The computations always depend on the prefixes of both the readand the reference sequences. As a workaround, we can compute the exact editdistance in an incremental approach. One can divide the dynamic programmingmatrix into sub matrices, where each smaller sub-matrix is fully covered by allthe larger sub-matrices. Such that each sub-matrix UN s strictly starts from theindex i of value 1 up to some s, where the following equation holds 1 ≤ i ≤ s ≤ m and i-E ≤ j ≤ i+E . We refer to this edit distance measure as a prefix editdistance . In the next chapter, we evaluate all these scenarios in details. In this chapter, we introduce the Sneaky Snake Problem, and we show how anapproximate edit distance problem can be converted to an instance of the SneakySnake Problem. Subsequently, we propose a new pre-alignment filtering algo-rithm (we call it SneakySnake) that obviates the need for expensive specializedhardware. The solution we provide is cost-effective given a limited resources envi-ronment. Our algorithm does not exploit any SIMD-enabled CPU instructions orvendor-specific processor. This makes it superior and attractive. We also provideefficient and scalable hardware architecture along with several design optimiza-tions for the SneakySnake algorithm. Finally, we discuss several optimizationsand challenges of accelerating both approximate and exact edit distance calcula-tions. 89 hapter 9Evaluation In this chapter, we evaluate the FPGA resource utilization, the filtering accu-racy, and the memory utilization of all our proposed pre-alignment filters. Wealso investigate the benefits of using our hardware and CPU-based pre-alignmentfiltering solutions along with the state-of-the-art aligners. We compare the per-formance of our proposed pre-alignment filters (SneakySnake, MAGNET, Shouji,and GateKeeper) with the state-of-the-art existing pre-alignment filter, SHD [65]and read aligners, Edlib [78], Parasail [54], GSWABE [57], CUDASW++ 3.0 [59],and FPGASW [53]. We run all experiments using 3.6 GHz Intel i7-3820 CPUwith 8 GB RAM. We use a Xilinx Virtex 7 VC709 board [133] to implement ouraccelerator architecture and our hardware filters. We build the FPGA designsusing Vivado 2015.4 in synthesizable Verilog. Our experimental evaluation uses 12 different real datasets. Each dataset contains30 million real sequence pairs. Next, we elaborate on how we obtain them.90able 9.1: Benchmark illumina-like read sets of whole human genome, obtainedfrom EMBL-ENA. Accession no. ERR240727_1 SRR826460_1 SRR826471_1Read length (bp) 100 150 250 No. of reads HTS Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Table 9.2: Benchmark illumina-like datasets (read-reference pairs). We map eachread set, described in Table 9.1, to the human reference genome in order togenerate four datasets using different mapper’s edit distance thresholds (using -e parameter). Accession no.Dataset no. mrFAST -e ERR240727_1 SRR826460_1 SRR826471_1 -e parameter of mrFAST to generate four real datasets. We summarize thedetails of these 12 datasets in Table 9.2. For the convenience of referring to thesedatasets, we number them from 1 to 12 (e.g., set 1 represents 30 million readsfrom ERR240727 1 mapped with -e = 2 edits). The 12 real datasets enable usto measure the effectiveness of the filters in tolerating low number of edits andfar more edits than the allowed edit distance threshold. We provide detailed in-formation on the number of correct and incorrect pairs of each of the 12 datasetsfor different user-defined edit distance thresholds in Table A.1, Table A.2, andTable A.3 in Appendix A. 91 .2 Resource Analysis We now examine the FPGA resource utilization for the hardware implementationof GateKeeper, Shouji, MAGNET, and SneakySnake pre-alignment filters. Weprovide several hardware designs for two commonly used edit distance thresholds,2 bp and 5 bp as reported in [18, 65, 122, 62], for a sequence length of 100 bp.The VC709 FPGA chip contains 433,200 slice LUTs (look-up tables) and 866,400slice registers (flip-flops). Table 9.3 lists the FPGA resource utilization for asingle filtering unit. We make five main observations.1. The design for a single MAGNET filtering unit requires about 10.5% and37.8% of the available LUTs for edit distance thresholds of 2 bp and 5bp, respectively. Hence, MAGNET can process 8 and 2 sequence pairsconcurrently for edit distance thresholds of 2 bp and 5 bp, respectively,without violating the timing constraints of our hardware accelerator.2. The design for a single Shouji filtering unit requires about 15x-21.9x lessLUTs compared to MAGNET. This enables Shouji to achieve more par-allelism over MAGNET design as it can have 16 filtering units within thesame FPGA chip.3. GateKeeper requires about 26.9x-53x and 1.7x-2.4x less LUTs compared toMAGNET and Shouji, respectively. GateKeeper can also examine up to 16sequence pairs at the same time.4. SneakySnake requires 15.4x-26.6x less LUTs compared to MAGNET. WhileSneakySnake requires a slightly less LUTs compared to Shouji, it requiresabout 2x more LUTs compared to GateKeeper. SneakySnake can also ex-amine up to 16 sequence pairs concurrently.5. We observe that the hardware implementations of Shouji, MAGNET, andSneakySnake require pipelining the design (i.e., shortening the critical pathdelay of each processing core by dividing it into stages or smaller tasks) toenable meeting the timing constraints and achieve more parallelism.92able 9.3: FPGA resource usage for a single filtering unit of GateKeeper, Shouji,MAGNET, and SneakySnake for a sequence length of 100 and under differentedit distance thresholds ( E ). E (bp) Slice LUT Slice Register No. of Filtering Units ShoujiMAGNETGateKeeperSneakySnake We build 8 pipeline stages for Shouji, 22 pipeline stages for MAGNET, and5 pipeline stage for SneakySnake to satisfy the timing constraints. How-ever, pipelining the design comes with the expense of increased registerutilization.We conclude that the FPGA resource usage is correlated with the filteringaccuracy. For example, the least accurate filter, GateKeeper, occupies the leastFPGA resource that can be integrated into the FPGA. We also conclude that theless the logic utilization of a single filtering unit, the more the number of filteringunits. Next, we assess the false accept rate and false reject rate of GateKeeper, Shouji,MAGNET, and SneakySnake across our 12 datasets. We also investigate andaddress several concerns that we raise in Chapter 8. We compare the accuracyperformance of our proposed pre-alignment filters with the best performing ex-isting pre-alignment filter, SHD [65]. 93HD supports a sequence length of up to only 128 characters (due to theSIMD register size). To ensure as fair a comparison as possible, we allow SHDto divide the long sequences into batches of 128 characters, examine each batchindividually, and then sum up the results. As we describe in Chapter 3, we aimto minimize the false accept rate so that the elimination of dissimilar sequencesis maximized. We also aim to maintain a 0% false reject rate. We use Edlib [78](set to edit distance mode) to generate the ground truth edit distance value foreach sequence pair as it has a zero false accept rate and a zero false reject rate. We raise two key questions in Chapter 8. (1) Can one approximate the editdistance between two sequences much faster than calculating the edit distance?(2) Can one reduce the search space of approximate and exact edit distance al-gorithms? To answer these two questions, we first evaluate the performance ofSneakySnake (approximate edit distance algorithm) and then examine the fea-sibility of reducing its search space without causing falsely-rejected mappings.Secondly, we examine the ability to implement the best performing existing exactedit distance algorithm, Edlib [78] such that it calculates the prefix edit distance.As we discuss in Chapter 8, we column-wise partition the unweighted neighbor-hood maze of SneakySnake algorithm into adjacent non-overlapping sub-matricesof the same size (2 E +1 by t ). In Figure 9.1, we illustrate the effects of this par-titioning on the false accept rate and the execution time of our SneakySnakealgorithm. We make two observations.1. Partitioning the search space of SneakySnake with a partitionsize of 5 ( t =5) reduces its execution time by up to 5.12x (Figure9.1a), 7.4x (Figure 9.1c), and 13.2x (Figure 9.1e) at the expenseof increased false accept rate by up to 55.4x (Figure 9.1 b), 43.5x (Figure9.1d), and 67.3x (Figure 9.1f). 94. There is a trade-off between the speed and the accuracy of SneakySnakealgorithm. For example, the least accurate filter, SneakySnake with a parti-tion size of 5 columns (or in short SneakySnake-5) yields the fastest speed.Next, we assess the effect of the number of replications on the filtering accuracyof the hardware implementation of the SneakySnake algorithm. We use a sub-matrix’s width of 8 columns ( t =8) and we vary the height of the sub-matrix from1 row (i.e., E =0 bp) up to 21 rows (i.e., E =10 bp). Based on Figure 9.2, wemake two observations:1. We observe that increasing the number of the replications in the designimproves the filtering accuracy of the SneakySnake algorithm. This obser-vation is in accord with our expectation as each replication detects at mosta single edit within each sub-matrix.2. We also observe that the hardware implementation of SneakySnake using3 replications (3 iterations for finding the optimal path within each sub-matrix) achieves a similar accuracy performance (or slightly better) as thatof the SneakySnake-5.We conclude that partitioning the search space of the SneakySnake algorithmis also beneficial for building an efficient hardware architecture while maintaininghigh filtering accuracy.Now, we modify Edlib algorithm such that it applies a prefix edit distance (weprovide the definition in Chapter 8). It starts computing only a small squaresub-matrix. If the computed edit distance meets the user-defined edit distancethreshold, then it extends the sub-matrix into a larger square one (which overlapsentirely with the smaller one) by increasing the number of columns and rows bya constant. For example, if the initial sub-matrix size is 5 columns by 5 rows,then we extend it into a larger one of size 10 columns by 10 rows, and next weextend it into a sub-matrix of size 15 columns by 15 rows.95 %8%15%23%30% 0 1 3 4 6 7 9 10 12 13 15 F a l s e A cc e p t R a t e Edit Distance ThresholdSneakySnake-150SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 0495990148519802475 0 1 3 4 6 7 9 10 12 13 15 P r e - a li g n m e n t F il t e r i n g T i m e ( s e c o nd s ) Edit Distance ThresholdSneakySnake-150SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-50%5%10%15%20% 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSneakySnake-250SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 0100020003000400050006000 0 2 5 7 10 12 15 17 20 22 25 P r e - a li g n m e n t F il t e r i n g T i m e ( s e c o nd s ) Edit Distance ThresholdSneakySnake-250SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 (c) (d)(e) (f) F a l s e A cc e p t R a t e Edit Distance ThresholdSneakySnake-100SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 02755508251100 0 1 2 3 4 5 6 7 8 9 10 P r e - a li g n m e n t F il t e r i n g T i m e ( s e c o nd s ) Edit Distance ThresholdSneakySnake-100SneakySnake-50SneakySnake-25SneakySnake-10SneakySnake-5 (a) (b) Figure 9.1: The effects of column-wise partitioning the search space of SneakyS-nake algorithm on the average false accept rate ((a), (c), and (e)) and the averageexecution time ((b), (d), and (f)) of examining set 1 to set 4 in (a) and (b), set 5to set 8 in (c) and (d), and set 9 to set 12 in (e) and (f). Besides the default size(equals the read length) of the SneakySnake’s unweighted neighborhood maze,we choose partition sizes (the number of grid’s columns that are included in eachpartition) of 5, 10, 25, and 50 columns.96 %20%40%60%80% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-5 0%20%40%60% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-50%4%8%12%16%20% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-5 0%1%1%2%2% 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance Threshold (bp)3 Replications4 Replications5 ReplicationsSneakySnake-100SneakySnake-5 (a) (b)(c) (d) Figure 9.2: The effects of the number of replications of the hardware implemen-tation of SneakySnake algorithm on its filtering accuracy (false accept rate). Weuse a wide range of edit distance thresholds (0 bp-10 bp for a read length of 100bp) and four datasets: (a) set 1, (b) set 2, (c) set 3, and (d) set 4.We keep extending the size of the sub-matrix until we cover the entire dynamicprogramming matrix or the prefix edit distance exceeds the edit distance thresh-old. In Figure 9.3, we present the effects of computing the prefix edit distance(partitioning the edit distance matrix) on the overall execution time. We alsoprovide the performance of the original and the partitioned SneakySnake algo-rithm for a better comparison. We provide more detailed results in Table A.4,Table A.5, and Table A.6 in Appendix A. We make two key observations.1. Our SneakySnake algorithm with a partition size of 5(SneakySnake-5) is up to 25.5x (Figure 9.3a), 52.5x (Figure 9.3b),and 94.5x (Figure 9.3c) faster than the best performing edit dis-tance algorithm, Edlib [78] .2. Prefix edit distance with large enough initial sub-matrix size provides aslight reduction in the execution time of Edlib.97e observe that setting the initial sub-matrix size to 50 bp provides thehighest reduction in the Edlib’s execution time over a wide range of editdistance thresholds. Edlib with an initial sub-matrix size of 50 bp (or inshort Edlib-50) is up to 1.9x (Figure 9.3a), 4.4x (Figure 9.3b), and 7.8x(Figure 9.3c) faster than the original Edlib. However, SneakySnake-5 isstill up to an order of magnitude (19.8x) faster than Edlib-50 when the editdistance threshold is set to 9 (for m =100) or 10 (for m =150 or 250) andbelow, as highlighted with a dashed vertical line in Figure 9.3. Note that E is typically less than or equal 5% of the read length [18, 65, 122, 62].We conclude that reducing the search space of both approximate and exactedit distance algorithms is beneficial. Combining SneakySnake-5 and Edlib-50can provide a fast examination for incorrect mappings across a wide range of editdistance thresholds (0% - 10% of the read length). We present in Figure 9.4, Figure 9.5, and Figure 9.6 the false accept rate of ourpre-alignment filters compared to SHD for read lengths of 100 bp, 150 bp, and250 bp, respectively. We make six key observations.1. We observe that Shouji, MAGNET, GateKeeper, and SneakySnake are lessaccurate in examining the low-edit sequences (Figure 9.4a,b, Figure 9.5a,b,and Figure 9.6a,b) than the edit-rich sequences (Figure 9.4c,d, Figure 9.5c,d,and Figure 9.6c,d). While SneakySnake pre-alignment filter yields the high-est accuracy, SHD [65] and GateKeeper provide the least accuracy comparedto all other pre-alignment filters. The slope of MAGNET plot is almostcomparable to that of the SneakySnake (with the default maze size).98 E x e c u t i o n T i m e o f E d li b ( s e c nd s ) Edit Distance ThresholdEdlib-100Edlib-50Edlib-25Edlib-10Edlib-5SneakySnake-100SneakySnake-5021643264886410801296 0 1 3 4 6 7 9 10 12 13 15 E x e c u t i o n T i m e o f E d li b ( s e c nd s ) Edit Distance ThresholdEdlib-150Edlib-50Edlib-25Edlib-10Edlib-5SneakySnake-150SneakySnake-50400800120016002000 0 2 5 7 10 12 15 17 20 22 25 E x e c u t i o n T i m e o f E d li b ( s e c nd s ) Edit Distance ThresholdEdlib-250Edlib-50Edlib-25Edlib-10Edlib-5SneakySnake-250SneakySnake-5 (c) (a) (b) Figure 9.3: The effects of computing the prefix edit distance on the overall execu-tion time of the edit distance calculations compared to the original Edlib (exactedit distance) and our partitioned implementation of SneakySnake algorithm. Wepresent the average time spent in examining set 1 to set 4 in (a), set 5 to set 8in (b), and set 9 to set 12 in (c). We choose initial sub-matrix sizes of 5, 10, 25,and 50 columns. We mark the intersection of SneakySnake-5 and Edlib-50 plotswith a dashed vertical line. 99. GateKeeper and SHD [65] become ineffective for edit distance thresholdsof greater than 8% for m = 100 bp (Figure 9.4), 5% for m = 150 bp(Figure 9.5), and 3% for m = 250 bp (Figure 9.6) for both low-edit andedit-rich sequences, where m is the read length. This leads to examiningeach sequence pair twice unnecessarily, by both GateKeeper or SHD andthe full alignment step.3. Shouji provides up to 17.2x, 73x, and 467x less false accept rate comparedto GateKeeper and SHD for read lengths of 100 bp, 150 bp, and 250 bp,respectively.4. MAGNET, SneakySnake (with the default maze size), and SneakySnake-5show a slow exponential degradation in their filtering inaccuracy for low-editsequences with a false accept rate of up to 50%, 30%, and 70%, respectively,as we show in Figure 9.4a,b, Figure 9.5a,b, and Figure 9.6a,b. They alsoshow almost a linear growth in their false accept rate of less than 4% foredit-rich sequences of different read lengths, as we show in Figure 9.4c,d,Figure 9.5c,d, and Figure 9.6c,d.5. MAGNET shows up to 1577x, 3550x, and 25552x less false accept ratecompared to GateKeeper and SHD for read lengths of 100 bp, 150 bp, and250 bp, respectively. MAGNET also provides up to 205x, 951x, and 16760xless false accept rate compared to Shouji for read lengths of 100 bp, 150 bp,and 250 bp, respectively.6. SneakySnake (with the default maze size) produces up to four and fiveorders of magnitude less false accept rate compared to Shouji and Gate-Keeper, respectively. SneakySnake shows up to 55.4x, 46.7x, and 67.1x lessfalse accept rate compared to SneakySnake-5 for read lengths of 100 bp,150 bp, and 250 bp, respectively. SneakySnake also shows up to 64.1x, 16x,and 22x less false accept rate compared to MAGNET for read lengths of100 bp, 150 bp, and 250 bp, respectively.We conclude that SneakySnake, MAGNET, and Shouji are very effective andsuperior to the state-of-the-art pre-alignment filter, SHD [65] in both situations100 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-5 00.20.40.60.81 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-500.20.40.60.81 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-500.20.40.60.81 0 1 2 3 4 5 6 7 8 9 10 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-100SneakySnake-5 (a) (b)(c) (d) Figure 9.4: The false accept rate produced by our pre-alignment filters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to the best performingfilter, SHD [65]. We use a wide range of edit distance thresholds (0-10 edits fora read length of 100 bp) and four datasets: (a) set 1, (b) set 2, (c) set 3, and (d)set 4.(low-edit and edit-rich mappings). They maintain a very low rate of falsely-accepted incorrect mappings and significantly improves the accuracy of pre-alignment filtering by up to five orders of magnitude compared to GateKeeperand SHD. Using our 12 low-edit and edit-rich datasets for three different read lengths, weobserve that SneakySnake (for all partition sizes), Shouji, GateKeeper, and SHDdo not filter out correct mappings; hence, they provide a 0% false reject rate.The reason behind that is the way we find the identical subsequences.101 c) (d) F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-5 00.20.40.60.81 0 1 3 4 6 7 9 10 12 13 15 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-5 (a) (b) F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-50%20%40%60%80%100% 0 1 3 4 6 7 9 10 12 13 15 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-150SneakySnake-5 Figure 9.5: The false accept rate produced by our pre-alignment filters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to the best performingfilter, SHD [65]. We use a wide range of edit distance thresholds (0-15 edits fora read length of 150 bp) and four datasets: (a) set 5, (b) set 6, (c) set 7, and (d)set 8. 102 c) (d) F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-5 00.20.40.60.81 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-500.20.40.60.81 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-5 00.20.40.60.81 0 2 5 7 10 12 15 17 20 22 25 F a l s e A cc e p t R a t e Edit Distance ThresholdSHDGateKeeperShoujiMAGNETSneakySnake-250SneakySnake-5 (a) (b) Figure 9.6: The false accept rate produced by our pre-alignment filters, Gate-Keeper, Shouji, MAGNET, and SneakySnake, compared to the best performingfilter, SHD [65]. We use a wide range of edit distance thresholds (0-25 edits fora read length of 250 bp) and four datasets: (a) set 9, (b) set 10, (c) set 11, and(d) set 12. 103 ead : CAAACTGGGTGGAGCCCACCACAGCTCAAAGGAAGCCTGCCTTCCTCTGTAGGCTCCACCTCTGGGGGCAGGGCACAGACAAACAAAAAGACAGCAGTAA Reference : CAAACTGGGTGGAGCCCACAACAGCTCAAGGAGGCCTGCCTGCCTCTATAGGCTCCACCTCTGGGGGCAGGGCACAGACAAACAAAAAGACAGCAGTAACUpper Diagonal-6 : ------1111111011111110110111001111111011110110110111001101111111111010001100011101110100111011001101Upper Diagonal-5 : -----11111101011101110010100111111111111111110011111111111000111111110001111010010101000101011111010Upper Diagonal-4 : ----011110001111110111111111111011111110111011011111101110110111111110000011011101110001101011011111Upper Diagonal-3 : ---1111111001011110100110111111010111000000001110110111111011111110110011111100010000010101001111101Upper Diagonal-2 : --10111101011011010010011101111000111101110100111101111010010111100110111111111101100100101110001011Upper Diagonal-1 : -100111001101110011111111111011101111111111110010111110110110011000111101100101010101000101011111111Main Diagonal : 0000000000000000000100000000010111101110111011111110111011011110000111001111111100110000111111111101Lower Diagonal-1 : 100111001101110011001111111000001000000001000001000000000000000000000000000000000000000000000000000-Lower Diagonal-2 : 10111101011011010111011101101010010111010101111111011101101111000011100111111110011000011111111110--Lower Diagonal-3 : 1111111001011110000010111111111111111111111001001111011011001100011110110010101010100010101111111---Lower Diagonal-4 : 011110001111110011111111111101011110111011011111111101001011110011011111111110110010010111000101----Lower Diagonal-5 : 11111101011101010011100111100011100001000111011111111101111111011001111110001000001010100111110-----Lower Diagonal-6 : 1111111011111110110111011110111101010101111111110111011011111111000001101110111000110101101111------MAGNET bit-vector : 0000000000000000000100000000011000101000000001010000000000000000000000000000000000000000000000000001142 3 5 6 7 MAGNET should select this identical segment instead of the one highlighted in red Figure 9.7: An example of a falsely-rejected mapping using MAGNET algorithmfor an edit distance threshold of 6. The random zeros (highlighted in red) confuseMAGNET filter causing it to select shorter segments of random zeros instead ofa longer identical subsequence (highlighted in blue).We aim to find the subsequences that has the largest number of zeros, suchthat we maximize the number of matches and minimize the number of editsthat cause the division of one long identical sequence into shorter subsequences.However, this is not the case for MAGNET. We observe that MAGNET shows avery low false reject rate of less than 0.00045% for an edit distance threshold of atleast 4 bp. This is due in large part to the greedy choice of always selecting thelongest identical subsequence based on their descending length and regardlessof their source (i.e. the vectors that originate them). We provide in Figure9.7 an example of where MAGNET fails to maintain the correct mappings. Apotential solution is to relate the number of encapsulated bits to the source ofeach segment. For instance, if any extracted segment in the MAGNET mask hasa different source than its neighboring segments, then we need to penalize thatsegment by adding more encapsulation bits. This can help us to produce accurate CIGAR string for each mapping. With the help of an auxiliary data structure, wecan keep track of the source of each extracted segment at each position. While thematches coming from the “upper diagonal-1” mean there is a single deletion, thematches coming from the “upper diagonal-2” mean that there are two deletions.A detailed algorithm of this topic is beyond the scope of this thesis and is partof our future work. 104 .4 Effects of Hardware Pre-Alignment Filter-ing on Read Alignment We first analyze the execution time of our hardware pre-alignment filters, Gate-Keeper, MAGNET, Shouji, and SneakySnake. We build the FPGA implemen-tation of SneakySnake using a sub-matrix’s width of 8 columns ( t =8) and weinclude 3 replications in the design. We use GateKeeper as an optimized and effi-cient hardware implementation of SHD [65]. We evaluate our four pre-alignmentfilters using a single FPGA chip. We use 120 million sequence pairs, each of whichis 100 bp long, from set 1, set 2, set 3, and set 4. We summarize the executiontime of the CPU implementations along with that of their hardware acceleratorsin Table 9.4. We make two key observations based on Table 9.4.1. Our hardware accelerators provide two to three orders of magnitude (322xto 7,250x) speedup over their CPU implementations. GateKeeper providesup to two orders of magnitude of acceleration over SHD.2. The execution time of the hardware implementation of SneakySnake andShouji are as low as that of GateKeeper and 2x-8x lower than that of MAG-NET pre-alignment filter. This observation is in accord with our expecta-tion and can be explained by the fact that MAGNET has more computa-tional overhead that limits the number of filtering units. Yet SneakySnakeis four and five orders of magnitude more accurate than both Shouji andGateKeeper (as we show earlier).We conclude that our hardware accelerator provides two and three orders ofmagnitude of speedup over their CPU implementations. Additionally, the exe-cution time of the hardware accelerator is proportional to the FPGA resourceutilization (the less the resource utilization the lower the execution time).Next we analyze the benefits of integrating our hardware pre-alignment filterswith the state-of-the-art aligners. GateKeeper, MAGNET, Shouji, and SneakyS-nake are standalone pre-alignment filters and can be integrated with any existing105able 9.4: The execution time (in seconds) of GateKeeper, MAGNET, Shouji,and SneakySnake under different edit distance thresholds. We use set 1 to set 4with a read length of 100. We provide the performance results for the CPUimplementations and the hardware accelerators with the maximum number offiltering units. E (bp) GateKeeper MAGNET Shouji SneakySnake-FPGA E (bp) SHD MAGNET-CPU Shouji-CPU SneakySnake-5 alignment algorithm. In Table 9.5, we present the effects of our four hardwarepre-alignment filters on the overall alignment’s execution time. We use a sub-matrix’s width of 8 columns ( t =8) and we include 3 replications in the designof the hardware architecture of the SneakySnake algorithm. We also comparethe effect of our pre-alignment filters with that of SHD [65]. We select five bestperforming aligners, each of which is designed for different type of computingplatforms. While Edlib [78] algorithm is implemented in C for standard CPUs,Parasail [54] exploits SIMD capable CPUs. GSWABE [57] is designed for GPUs.CUDASW++ 3.0 [59] exploits SIMD capability of both CPUs and GPUs. FP-GASW [53] exploit the very large number of hardware execution units offered bythe same FPGA chip (i.e., VC709) as our accelerator. We evaluate the executiontime of Edlib [78] and Parasail [54] on our machine.106owever, FPGASW [53], CUDASW++ 3.0 [59], and GSWABE [57] are notopen-source and not available to us. Therefore, we scale the reported numberof computed entries of the dynamic programming matrix in a second. We use atotal of 120 million real sequence pairs from our previously described four datasets(set 1, set 2, set 3, and set 4) in this analysis. We make three key observations.1. The execution time of Edlib [78] reduces by up to 21.4x, 18.8x, 16.5x, 13.9x,and 5.2x after the addition of SneakySnake, Shouji, MAGNET, GateKeeper,and SHD, respectively, as a pre-alignment filtering step. We also observenearly a similar trend for Parasail [54] combined with each of the four pre-alignment filters.2. Aligners designed for FPGAs and GPUs follow a different trend than thatwe observe in the CPU aligners. We observe that the ability of SHD [65]to reduce the alignment time of GSWABE [57], CUDASW++ 3.0 [59], andFPGASW [53] diminishes. SHD even provides unsatisfactory performanceas it increases the execution time of the aligner instead of reducing it. Thisis due to the fact that SHD is 6x slower than CUDASW++ 3.0 [59] andFPGASW [53] and it is lightly slower than GSWABE [57].3. SneakySnake, Shouji, MAGNET, and GateKeeper still contribute signif-icantly towards reducing the overall execution time of FPGA and GPUbased aligners. SneakySnake reduces the execution time of FPGASW [53],CUDASW++ 3.0 [59] and GSWABE [57] by factors of up to 16x, 15.5x,and 20.3x, respectively. This is slightly higher (up to 1.3x) than the effectof Shouji on the execution time of these aligners. Shouji reduces the overallalignment time of FPGASW [53], CUDASW++ 3.0 [59] and GSWABE [57]by factors of up to 14.5x, 14.2x, and 17.9x, respectively. This is up to 1.5x,1.4x, and 85x more than the effect of MAGNET, GateKeeper, and SHD onthe end-to-end alignment time.We conclude that among the four hardware pre-alignment filters, SneakySnake(3-replication design and with t =8) is the best performing filter in terms of bothspeed and accuracy. 107able 9.5: End-to-end execution time (in seconds) for several state-of-the-artsequence alignment algorithms, with and without pre-alignment filters (SneakyS-nake, Shouji, MAGNET, GateKeeper, and SHD) and across different edit distancethresholds. We use four datasets (set 1, set 2, set 3, and set 4) across differentedit distance thresholds. E (bp) Edlib w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) Parasail w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) FPGASW w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) CUDASW++ 3.0 w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD E (bp) GSWABE w/SneakySnake w/ Shouji w/ MAGNET w/ GateKeeper w/ SHD Integrating SneakySnake with aligner does not lead to negative effects. We alsoconclude that the concept of pre-alignment filtering is still effective in boostingthe overall performance of the alignment step, even the dynamic programmingalgorithm is accelerated by the state-of-the-art hardware accelerators such asSIMD-capable CPUs, FPGAs, and GPUs. Now we analyze the benefits of integrating our CPU implementations of SneakyS-nake and the prefix edit distance with the state-of-the-art CPU aligners, Edlib[78] and Parasail [54]. 108e evaluate the end-to-end execution time of Edlib (referred to as path in [78]and configured as a banded global Levenshtein distance with CIGAR-enabledoutput) with and without the pre-alignment filtering step for read lengths of 100bp (Figure 9.8), 150 bp (Figure 9.9), and 250 bp (Figure 9.10). We make threekey observations.1. SneakySnake-5 combined with the banded Edlib is up to 20.4x, 33x, and43x faster than Edlib without pre-alignment filter for E < m =100 bp), E < 12 bp ( m =150 bp), and E < 15 bp ( m =250 bp), where E is the editdistance threshold and m is the read length.2. Edlib-50 combined with the banded Edlib is slower than SneakySnake-5combined with the same aligner for low-edit datasets (Figure 9.8a-b, Fig-ure 9.9a-b, and Figure 9.10a-b). Edlib-50 becomes more effective overSneakySnake-5 in reducing the execution time of Edlib across edit-richdatasets for E > m =100 bp and 150 bp) and E > m =250 bp).3. SHD leads to a negative effect as it slows down the alignment speed of Edlibfor E > m =100 bp), E > m =150 bp and 250 bp).Secondly, we evaluate the effects of adding these pre-alignment filters to Para-sail [54] for read lengths of 100 bp (Figure 9.11), 150 bp (Figure 9.12), and 250 bp(Figure 9.13). We configure Parasail as NW banded and with CIGAR-enabledoutput. We make three key observations.1. SneakySnake-5 still provides significant benefits to the the highest end-to-end speedup over all other pre-alignment filters when combined withParasail for E < m =100 bp), E < m =150 bp), and E < 10 bp( m =250 bp). SneakySnake-5 combined with Parasail yields up to 36.3x,42x, and 57.9x speedup over Parasail without a pre-alignment filter forread lengths of 100 bp, 150 bp, and 250 bp, respectively.2. Edlib-50 combined with Parasail becomes more effective than SneakySnake-5 combined with Parasail for E > m =100 bp), E > m =150 bp),and E > m =250 bp). 109 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 070140210280350 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib(Path)050100150200250 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib(Path) 050100150200250 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-100+EdlibSneakySnake-5+EdlibEdlib 50+Edlib(Path) (a) (b)(d) (d) Figure 9.8: End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment filters. We use four datasets ((a) set 1,(b) set 2, (c) set 3, and (d) set 4) across different edit distance thresholds. Wehighlight in a dashed vertical line the edit distance threshold where Edlib startsto outperform our SneakySnake-5 algorithm.110 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib -8010100190280370460550 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib0100200300400500 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 0100200300400500 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-150+EdlibSneakySnake-5+EdlibEdlib 50+Edlib (a) (b)(d) (d) Figure 9.9: End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment filters. We use four datasets ((a) set 5,(b) set 6, (c) set 7, and (d) set 8) across different edit distance thresholds. Wehighlight in a dashed vertical line the edit distance threshold where Edlib startsto outperform our SneakySnake-5 algorithm.111 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 040080012001600 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib040080012001600 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib 040080012001600 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Edlib Banded Path+CIGARSHD+EdlibSneakySnake-250+EdlibSneakySnake-5+EdlibEdlib 50+Edlib (a) (b)(d) (d) Figure 9.10: End-to-end execution time (in seconds) for Edlib [78] (full readaligner), with and without pre-alignment filters. We use four datasets ((a) set 9,(b) set 10, (c) set 11, and (d) set 12) across different edit distance thresholds. Wehighlight in a dashed vertical line the edit distance threshold where Edlib startsto outperform our SneakySnake-5 algorithm.112 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)Parasail (NW_banded+CIGAR)SHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0225450675900 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail0225450675900 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0225450675900 0 1 2 3 4 5 6 7 8 9 10 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-100+ParasailSneakySnake-5+ParasailEdlib 50+Parasail (a) (b)(d) (d) Figure 9.11: End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment filters. We use four datasets ((a) set 1,(b) set 2, (c) set 3, and (d) set 4) across different edit distance thresholds.3. SHD leads to slowing down the alignment speed of Parasail for E > m =100bp), E > m =150 bp and 250 bp).We conclude that our SneakySnake algorithm is the best performing CPU pre-alignment filter in terms of both speed and accuracy. It accelerates the state-of-the-art read alignment algorithms by up to an order of magnitude of acceleration.We demonstrate that SneakySnake algorithm does not lead to negative effects foredit distance thresholds of 0% to 10% of the read length. We also want to empha-size that combining our SneakySnake pre-alignment filter (i.e., SneakySnake-5)with a prefix edit distance algorithm (i.e., Edlib-50) provides the fastest and themost accurate pre-alignment filtering across wide range of edit distance thresholdsand read lengths. 113 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0400800120016002000 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail020040060080010001200140016001800 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 0400800120016002000 0 1 3 4 6 7 9 10 12 13 15 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-150+ParasailSneakySnake-5+ParasailEdlib 50+Parasail (a) (b)(d) (d) Figure 9.12: End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment filters. We use four datasets ((a) set 5,(b) set 6, (c) set 7, and (d) set 8) across different edit distance thresholds. In this section, we evaluate the space-efficiency benefits of integrating ourSneakySnake algorithm with the state-of-the-art full read aligner algorithm, Edlib[78]. We use Valgrind massif tool to examine the memory utilization. We providethe memory footprint of Edlib in Figure 9.14. On average, Edlib shows a mem-ory footprint of 150 KB. We then evaluate the memory utilization of integratingEdlib (in edit distance mode and without backtracking) with Edlib (path) inFigure 9.15. The addition of exact edit distance algorithm to the read alignmentshows a slight reduction ( ∼ E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 09001800270036004500 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail09001800270036004500 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail 09001800270036004500 0 2 5 7 10 12 15 17 20 22 25 E nd - t o - E nd A li g n m e n t ' s E x e c u t i o n T i m e ( s e c ) Edit DistanceThreshold (bp)ParasailSHD+ParasailSneakySnake-250+ParasailSneakySnake-5+ParasailEdlib 50+Parasail (a) (b)(d) (d) Figure 9.13: End-to-end execution time (in seconds) for Parasail [54] (full readaligner), with and without pre-alignment filters. We use four datasets ((a) set 9,(b) set 10, (c) set 11, and (d) set 12) across different edit distance thresholds.This observation is in accord with our expectation and can be explained bythe fact that SneakySnake-5 requires a spaces of as small as a sub-matrix ofsize 5 x 2 E +1 and at most m x 2 E +1, whereas Edlib always requires a spaceof m x m . We conclude that SneakySnake algorithm is fast, accurate, and yetmemory-efficient. 115igure 9.14: Memory utilization of Edlib (path) read aligner while evaluatingset 12 for an edit distance threshold of 25.Figure 9.15: Memory utilization of exact edit distance algorithm (Edlib ED) com-bined with Edlib (path) read aligner while evaluating set 12 for an edit distancethreshold of 25. 116igure 9.16: Memory utilization of SneakySnake-5 combined with Edlib (path)read aligner while evaluating set 12 for an edit distance threshold of 25.117 hapter 10Conclusions and FutureDirections Our goal in this thesis is to considerably minimize the time spent on calculatingthe optimal alignment in genome analysis, given limited computational resources(i.e., personal computer or small hardware). To this end, we first provide acomprehensive accuracy analysis of the pre-alignment filtering. Understandingthe causes for the filtering inaccuracy helps us to design new fast and accuratepre-alignment filters. Second, we propose the first hardware accelerator archi-tecture for pre-alignment in genome read mapping. We leverage the large num-ber of filtering units that our hardware accelerator offers for accelerating ourproposed hardware-aware algorithms. We propose four hardware pre-alignmentfilters, GateKeeper, Shouji, MAGNET, and SneakySnake. In our experimentalevaluation, our hardware pre-alignment filters show, on average, three orders ofmagnitude speedup over their equivalent CPU implementations. We demonstratethat GateKeeper occupies the least percentage of the FPGA resource and it isthe least accurate filter. We show that MAGNET provides a low false accept ratebut incurs a very low rate of falsely rejected mappings. We also demonstrate thatShouji is more accurate than GateKeeper and faster than MAGNET. However,SneakySnake is our best performing pre-alignment filter in terms of both speedand accuracy. 118neakySnake has a very low false accept rate and 0% false reject rate. Wedemonstrate that SneakySnake reduces the execution time of existing read align-ers by up to an order of magnitude.Third, we introduce a fast and cost-effective CPU implementation of our bestperforming pre-alignment filter, SneakySnake. In our comprehensive evaluation,we demonstrate that the CPU implementation of SneakySnake reduces the ex-ecution time of the best performing read aligner, Edlib and Parasail, by up to43x and 57.9x, respectively. We also experimentally demonstrate that SneakyS-nake has 50% less memory footprint compared with that of Edlib. The CPUimplementation of SneakySnake obviates the need for costly hardware and highhardware design efforts by providing a fast and cost-effective implementation.We demonstrate that the concept of pre-alignment filtering provides substan-tial benefits to the existing and future read alignment algorithms. New acceler-ated sequence aligners are frequently introduced that offer different strengths andfeatures. Our proposed pre-alignment filters offer the ability to accelerate existingaligners by an order of magnitude without sacrificing any of their capabilities andfeatures. As such, we hope that it catalyzes the adoption of our proposed pre-alignment filters in genome sequence analysis, which are becoming increasinglynecessary to cope with the processing requirements of greatly increasing amountsof genomic data. This thesis opens up several avenues of future research directions. In this section,we describe five directions based on the ideas and approaches proposed in thisthesis. These ideas can lead to a new read mapper or improve existing ones,which we will explore for various mappers in our future research.119. The first potential target of our research is to influence the design of moreintelligent and attractive sequencing machines by integrating SneakySnakeor Shouji inside them, to perform a real-time pre-alignment filtering. Se-quencing machines (e.g., Illumina HiSeq 2500, HiSeq 4000, and MiSeq) areequipped with FPGA chips for accelerating their internal computations.Integrating our pre-alignment filters with the sequencing machine has twobenefits. First, it allows a significant reduction in the total execution timeof genome analysis by starting read mapping while still sequencing [144].Second, it can hide the complexity and details of the underlying hardwarefrom users who are not necessarily fluent in FPGAs.2. Cloud-enabled pre-alignment filtering in a pay-per-use fashion is also apromising solution to encourage the use of such pre-alignment filters andaddress the concern of lack of experience in dealing with hardware designs.Cloud computing offers access to a large number of advanced FPGA chipsthat can be used concurrently via a simple user-friendly interface. However,such scenario requires the development of privacy-preserving pre-alignmentfilters due to privacy and legal concerns related to the share of sensitivegenome data to a third party [145, 146]. Our next efforts will focus onexploring privacy-preserving real-time pre-alignment filtering.3. Since a single-core SneakySnake/GateKeeper/Shouji has only a small foot-print on the FPGA, we can combine our architecture with any of the FPGA-based accelerators for BWT-FM or hash-based mapping techniques on asingle FPGA chip. With such a combination, the end result would be anefficient and fast multi-layer mapping system: alignments that pass our pre-alignment filter can be further verified using a dynamic programing basedalignment algorithm within the same chip.4. To further improve the performance of our SneakySnake pre-alignment fil-ter, we can take full advantage of the redundancy across both referencegenome and reads present in large sequencing data sets and store the op-timal path for small subsequences of the read and the reference sequences.We later use the pre-computed path towards calculating the approximateedit distance quickly. 120his requires the understanding of the trade-offs between the memory foot-print of seed filtering and the speed of pre-alignment filtering. Anotherapproach for accelerating our SneakySnake algorithm is to exploit GPUs ormulti-threaded processors to achieve a highly parallel implementation.5. We also aim to explore the possibility of achieving full alignment step inlinear time complexity in term of read length. We believe that we canachieve this challenging goal by improving the accuracy of DFS algorithmin SneakySnake. If we penalize each potential path based on its distancefrom the main diagonal, we can eventually infer the the exact number ofedit distance along the entire path. This leads to not only calculate the editdistance accurately, but also to on-the-fly provide the optimal alignmentwithout performing “backtracking” by printing the snake’s path. The pathcontains all needed information such as the location, the number, and thetype of the edits involved. 121 hapter 11Other Works of This Author During the course of my doctoral studies, I had the opportunity to work closelyand collaborate with my fellow graduate students and researchers from manyother institutions. These projects help me to 1) learn more about genome anal-ysis, metagenomics, and bioinformatics in general, 2) develop my own ideas andskills to conduct a successful research, and 3) widen my network of good mentorsand researchers. In this chapter, I would like to acknowledge my collaboratorsand these projects related to this dissertation.In a collaboration with Eleazar Eskin, Serghei Mangul, David Koslicki, andNathan LaPierre, we worked on developing new methods for accurate and compre-hensive microbial community profiling. I also worked on developing an interactivevisualization tool for facilitating the analysis of the profiling results, which leavesthe exploration and interpretation to the user. This gave me the opportunityto extend my knowledge about metagenomics and gain more insights about theneed for accelerating metagenomics profiling tools. This work is called MiCoPand published in BMC Genomics [40] and in bioRxiv [147].I am very happy to start this collaboration and continue working together onother related projects. During the last year of my doctoral studies, we started122nother two new projects: 1) improving the speed and accuracy of our metage-nomics profiler using a new seed filtering algorithm and a new read mapper and2) providing a survey of algorithmic foundations and methodologies across readmapping methods for both short and long reads. We discuss how longer readlengths produce unique advantages and limitations to read mapping techniques.We also discuss how general mapping algorithms have been tailored to the specificneeds of various domains in biology, including whole transcriptome, adaptive im-mune repertoire, and human microbiome studies. This work is the first of its kindto provide a clear roadmap of how technology dictates read mapping algorithms.In a collaboration with Onur Mutlu, Jeremie Kim, and Damla Senol Cali,we developed GRIM-Filter, a pre-alignment filter for reducing the number ofinvalid mapping in genome analysis. It exploits the high memory bandwidthand the logic layer of 3D-stacked memory to perform highly-parallel filteringinside the 3D-stacked memory itself. This gave me the opportunity to learn moreabout 3D-stacked memory architecture, which has a lot in common with memoryarchitectures. This work is published in BMC Genomics [25] and available onarXiv [109].In a collaboration with Erman Ayday, Nour Almadhoun, and Azita Nouri,we survey a wide spectrum of cross-layer privacy breaching strategies to humangenomic data (using both public genomic databases and other public non-genomicdata). We outline the principles and outcomes of each technique, and assess itstechnological complexity and maturation. We then review potential privacy-preserving countermeasure mechanisms for each threat. This work introduces meinto the importance of developing privacy-preserving tools for genome analysisapplications. This work is presented in DPM [146] and in PRIVAGEN [148].All of these works are closely related to this dissertation. During these col-laborations, we all gained valuable expertise on improving genome analysis bylearning from each other. 123 ibliography [1] J. M. Lane, J. Liang, I. Vlasac, S. G. Anderson, D. A. Bechtold, J. Bow-den, R. Emsley, S. Gill, M. A. Little, A. I. Luik, et al. , “Genome-wideassociation analyses of sleep disturbance traits identify new loci and high-light shared genetics with neuropsychiatric and metabolic traits,” Naturegenetics , vol. 49, no. 2, p. 274, 2017.[2] G. Dobigny, J. Britton-Davidian, and T. J. Robinson, “Chromosomal poly-morphism in mammals: an evolutionary perspective,” Biological Reviews ,vol. 92, no. 1, pp. 1–21, 2017.[3] P. H. Sudmant, T. Rausch, E. J. Gardner, R. E. Handsaker, A. Abyzov,J. Huddleston, Y. Zhang, K. Ye, G. Jun, M. H.-Y. Fritz, et al. , “An in-tegrated map of structural variation in 2,504 human genomes,” Nature ,vol. 526, no. 7571, p. 75, 2015.[4] A. Korte and A. Farlow, “The advantages and limitations of trait analysiswith GWAS: a review,” Plant methods , vol. 9, no. 1, p. 29, 2013.[5] P. M. Visscher, M. A. Brown, M. I. McCarthy, and J. Yang, “Five yearsof GWAS discovery,” The American Journal of Human Genetics , vol. 90,no. 1, pp. 7–24, 2012.[6] F. Antonacci, J. M. Kidd, T. Marques-Bonet, M. Ventura, P. Siswara,Z. Jiang, and E. E. Eichler, “Characterization of six human disease-associated inversion polymorphisms,” Human molecular genetics , vol. 18,no. 14, pp. 2555–2566, 2009. 1247] P. K. Han, K. L. Umstead, B. A. Bernhardt, R. C. Green, S. Joffe,B. Koenig, I. Krantz, L. B. Waterston, L. G. Biesecker, and B. B. Biesecker,“A taxonomy of medical uncertainties in clinical genome sequencing,” Ge-netics in Medicine , vol. 19, no. 8, p. 918, 2017.[8] P. M. Visscher, N. R. Wray, Q. Zhang, P. Sklar, M. I. McCarthy, M. A.Brown, and J. Yang, “10 years of GWAS discovery: biology, function, andtranslation,” The American Journal of Human Genetics , vol. 101, no. 1,pp. 5–22, 2017.[9] L. Chin, J. N. Andersen, and P. A. Futreal, “Cancer genomics: from dis-covery science to personalized medicine,” Nature medicine , vol. 17, no. 3,p. 297, 2011.[10] L. Ding, M. C. Wendl, D. C. Koboldt, and E. R. Mardis, “Analysis ofnext-generation genomic data in cancer: accomplishments and challenges,” Human molecular genetics , vol. 19, no. R2, pp. R188–R196, 2010.[11] D. E. Pritchard, F. Moeckel, M. S. Villa, L. T. Housman, C. A. McCarty,and H. L. McLeod, “Strategies for integrating personalized medicine intohealthcare practice,” Personalized Medicine , vol. 14, no. 2, pp. 141–152,2017.[12] E. D. Esplin, L. Oei, and M. P. Snyder, “Personalized sequencing and thefuture of medicine: discovery, diagnosis and defeat of disease,” Pharma-cogenomics , vol. 15, no. 14, pp. 1771–1790, 2014.[13] M. L. Metzker, “Sequencing technologies—the next generation,” Naturereviews genetics , vol. 11, no. 1, p. 31, 2010.[14] M. A. Hamburg and F. S. Collins, “The path to personalized medicine,” New England Journal of Medicine , vol. 363, no. 4, pp. 301–304, 2010.[15] G. S. Ginsburg and J. J. McCarthy, “Personalized medicine: revolutioniz-ing drug discovery and patient care,” TRENDS in Biotechnology , vol. 19,no. 12, pp. 491–496, 2001. 12516] J. A. Reuter, D. V. Spacek, and M. P. Snyder, “High-throughput sequencingtechnologies,” Molecular cell , vol. 58, no. 4, pp. 586–597, 2015.[17] T. Smith and M. Waterman, “Identification of Common Molecular Subse-quences,” Molecular Biology , vol. 147, pp. 195–197, 1981.[18] H. Cheng, H. Jiang, J. Yang, Y. Xu, and Y. Shang, “BitMapper: an efficientall-mapper based on bit-vector computing,” BMC bioinformatics , vol. 16,no. 1, p. 192, 2015.[19] H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, and C. Alkan, “Ac-celerating read mapping with FastHASH,” BMC genomics , vol. 14, no. 1,p. S13, 2013.[20] F. Hach, F. Hormozdiari, C. Alkan, F. Hormozdiari, I. Birol, E. E. Eichler,and S. C. Sahinalp, “mrsFAST: a cache-oblivious algorithm for short-readmapping,” Nature methods , vol. 7, no. 8, p. 576, 2010.[21] H. Li, “Aligning sequence reads, clone sequences and assembly contigs withBWA-MEM,” arXiv preprint arXiv:1303.3997 , 2013.[22] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie2,” Nature methods , vol. 9, no. 4, p. 357, 2012.[23] R. Luo, T. Wong, J. Zhu, C.-M. Liu, X. Zhu, E. Wu, L.-K. Lee, H. Lin,W. Zhu, D. W. Cheung, et al. , “SOAP3-dp: fast, accurate and sensitiveGPU-based short read aligner,” PloS one , vol. 8, no. 5, p. e65632, 2013.[24] G. Navarro, “A guided tour to approximate string matching,” ACM com-puting surveys (CSUR) , vol. 33, no. 1, pp. 31–88, 2001.[25] J. S. Kim, D. S. Cali, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan,O. Ergin, C. Alkan, and O. Mutlu, “GRIM-Filter: Fast seed location filter-ing in DNA read mapping using processing-in-memory technologies,” BMCGenomics , vol. 19, no. 2, p. 89, 2018.[26] S. Marco-Sola, M. Sammeth, R. Guig´o, and P. Ribeca, “The GEM mapper:fast, accurate and versatile alignment by filtration,” Nature methods , vol. 9,no. 12, p. 1185, 2012. 12627] C. Firtina, J. S. Kim, M. Alser, D. S. Cali, A. E. Cicek, C. Alkan,and O. Mutlu, “Apollo: A Sequencing-Technology-Independent, Scal-able, and Accurate Assembly Polishing Algorithm,” arXiv preprintarXiv:1902.04341 , 2019.[28] E. J. Fox, K. S. Reid-Bayliss, M. J. Emond, and L. A. Loeb, “Accuracyof next generation sequencing platforms,” Next generation, sequencing &applications , vol. 1, 2014.[29] K. J. McKernan, H. E. Peckham, G. L. Costa, S. F. McLaughlin, Y. Fu,E. F. Tsung, C. R. Clouser, C. Duncan, J. K. Ichikawa, C. C. Lee, et al. , “Sequence and structural variation in a human genome uncoveredby short-read, massively parallel ligation sequencing using two-base encod-ing,” Genome research , vol. 19, no. 9, pp. 1527–1541, 2009.[30] C. Calude, K. Salomaa, and S. Yu, “Additive distances and quasi-distancesbetween words,” Journal of Universal Computer Science , vol. 8, no. 2,pp. 141–152, 2002.[31] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions,and reversals,” in Soviet physics doklady , vol. 10, pp. 707–710, 1966.[32] S. B. Needleman and C. D. Wunsch, “A general method applicable to thesearch for similarities in the amino acid sequence of two proteins,” Journalof molecular biology , vol. 48, no. 3, pp. 443–453, 1970.[33] S. Canzar and S. L. Salzberg, “Short read mapping: an algorithmic tour,” Proceedings of the IEEE , vol. 105, no. 3, pp. 436–458, 2017.[34] M. Escalona, S. Rocha, and D. Posada, “A comparison of tools for thesimulation of genomic next-generation sequencing data,” Nature ReviewsGenetics , vol. 17, no. 8, p. 459, 2016.[35] M. A. Eberle, E. Fritzilas, P. Krusche, M. K¨allberg, B. L. Moore, M. A.Bekritsky, Z. Iqbal, H.-Y. Chuang, S. J. Humphray, A. L. Halpern, et al. ,“A reference data set of 5.4 million phased human variants validated by ge-netic inheritance from sequencing a three-generation 17-member pedigree,” Genome research , vol. 27, no. 1, pp. 157–164, 2017.12736] G. X. Zheng, B. T. Lau, M. Schnall-Levin, M. Jarosz, J. M. Bell, C. M.Hindson, S. Kyriazopoulou-Panagiotopoulou, D. A. Masquelier, L. Merrill,J. M. Terry, et al. , “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing,” Nature biotechnology , vol. 34, no. 3,p. 303, 2016.[37] M. Griffith, C. A. Miller, O. L. Griffith, K. Krysiak, Z. L. Skidmore,A. Ramu, J. R. Walker, H. X. Dang, L. Trani, D. E. Larson, et al. , “Opti-mizing cancer genome sequencing and analysis,” Cell systems , vol. 1, no. 3,pp. 210–223, 2015.[38] I. Iossifov, B. J. O’roak, S. J. Sanders, M. Ronemus, N. Krumm, D. Levy,H. A. Stessman, K. T. Witherspoon, L. Vives, K. E. Patterson, et al. , “Thecontribution of de novo coding mutations to autism spectrum disorder,” Nature , vol. 515, no. 7526, p. 216, 2014.[39] M. Meyerson, S. Gabriel, and G. Getz, “Advances in understanding cancergenomes through second-generation sequencing,” Nature Reviews Genetics ,vol. 11, no. 10, p. 685, 2010.[40] N. LaPierre, S. Mangul, M. Alser, I. Mandric, N. C. Wu, D. Koslicki, andE. Eskin, “MiCoP: microbial community profiling method for detecting vi-ral and fungal organisms in metagenomic samples,” BMC genomics , vol. 20,no. 5, p. 423, 2019.[41] J. Qin, Y. Li, Z. Cai, S. Li, J. Zhu, F. Zhang, S. Liang, W. Zhang, Y. Guan,D. Shen, et al. , “A metagenome-wide association study of gut microbiotain type 2 diabetes,” Nature , vol. 490, no. 7418, p. 55, 2012.[42] N. Segata, L. Waldron, A. Ballarini, V. Narasimhan, O. Jousson, andC. Huttenhower, “Metagenomic microbial community profiling using uniqueclade-specific marker genes,” Nature methods , vol. 9, no. 8, p. 811, 2012.[43] E. Afshinnekoo, C. Meydan, S. Chowdhury, D. Jaroudi, C. Boyer, N. Bern-stein, J. M. Maritz, D. Reeves, J. Gandara, S. Chhangawala, et al. ,“Geospatial resolution of human and bacterial diversity with city-scalemetagenomics,” Cell systems , vol. 1, no. 1, pp. 72–87, 2015.12844] S. K. Delaney, M. L. Hultner, H. J. Jacob, D. H. Ledbetter, J. J. McCarthy,M. Ball, K. B. Beckman, J. W. Belmont, C. S. Bloss, M. F. Christman, et al. , “Toward clinical genomics in everyday medicine: perspectives andrecommendations,” Expert review of molecular diagnostics , vol. 16, no. 5,pp. 521–532, 2016.[45] L. P´erez-Lago, M. Mart´ınez-Lirola, S. Garc´ıa, M. Herranz, I. Mokrousov,I. Comas, L. Mart´ınez-Priego, E. Bouza, and D. Garc´ıa-de Viedma, “UrgentImplementation in a Hospital Setting of a Strategy To Rule Out SecondaryCases Caused by Imported Extensively Drug-Resistant Mycobacterium tu-berculosis Strains at Diagnosis,” Journal of clinical microbiology , vol. 54,no. 12, pp. 2969–2974, 2016.[46] S. F. Kingsmore, J. Petrikin, L. K. Willig, and E. Guest, “Emergency med-ical genomes: a breakthrough application of precision medicine,” Genomemedicine , vol. 7, no. 1, p. 82, 2015.[47] L. K. Willig, J. E. Petrikin, L. D. Smith, C. J. Saunders, I. Thiffault, N. A.Miller, S. E. Soden, J. A. Cakici, S. M. Herd, G. Twist, et al. , “Whole-genome sequencing for identification of Mendelian disorders in critically illinfants: a retrospective analysis of diagnostic and clinical findings,” TheLancet Respiratory Medicine , vol. 3, no. 5, pp. 377–387, 2015.[48] N. M. Sweeney, S. A. Nahas, S. Chowdhury, M. Del Campo, M. C. Jones,D. P. Dimmock, S. F. Kingsmore, R. Investigators, et al. , “The case forearly use of rapid whole genome sequencing in management of criticallyill infants: Late diagnosis of Coffin-Siris syndrome in an infant with leftcongenital diaphragmatic hernia, congenital heart disease and recurrentinfections,” Molecular Case Studies , pp. mcs–a002469, 2018.[49] L. Farnaes, A. Hildreth, N. M. Sweeney, M. M. Clark, S. Chowdhury, S. Na-has, J. A. Cakici, W. Benson, R. H. Kaplan, R. Kronick, et al. , “Rapidwhole-genome sequencing decreases infant morbidity and cost of hospital-ization,” NPJ genomic medicine , vol. 3, no. 1, p. 10, 2018.[50] J. S. Berg, L. M. Amendola, C. Eng, E. Van Allen, S. W. Gray, N. Wagle,H. L. Rehm, E. T. DeChene, M. C. Dulik, F. M. Hisama, et al. , “Processes129nd preliminary outputs for identification of actionable genes as incidentalfindings in genomic sequence data in the Clinical Sequencing ExploratoryResearch Consortium,” Genetics in Medicine , vol. 15, no. 11, p. 860, 2013.[51] C. R. Ferreira, D. S. Regier, D. W. Hadley, P. S. Hart, and M. Muenke,“Medical genetics and genomic medicine in the United States of America.Part 1: history, demographics, legislation, and burden of disease,” Molecu-lar genetics & genomic medicine , vol. 5, no. 4, pp. 307–316, 2017.[52] S. S. Banerjee, M. El-Hadedy, J. B. Lim, Z. T. Kalbarczyk, D. Chen,S. Lumetta, and R. K. Iyer, “ASAP: Accelerated Short-Read Alignmenton Programmable Hardware,” arXiv preprint arXiv:1803.02657 , 2018.[53] X. Fei, Z. Dan, L. Lina, M. Xin, and Z. Chunlei, “FPGASW: Acceler-ating Large-Scale Smith–Waterman Sequence Alignment Application withBacktracking on FPGA Linear Systolic Array,” Interdisciplinary Sciences:Computational Life Sciences , vol. 10, no. 1, pp. 176–188, 2018.[54] J. Daily, “Parasail: SIMD C library for global, semi-global, and local pair-wise sequence alignments,” BMC bioinformatics , vol. 17, no. 1, p. 81, 2016.[55] E. J. Houtgast, V.-M. Sima, K. Bertels, and Z. Al-Ars, “An FPGA-basedsystolic array to accelerate the BWA-MEM genomic mapping algorithm,”in Embedded Computer Systems: Architectures, Modeling, and Simulation(SAMOS), 2015 International Conference on , pp. 221–227, IEEE, 2015.[56] E. Georganas, A. Bulu¸c, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick,“meraligner: A fully parallel sequence aligner,” in Parallel and DistributedProcessing Symposium (IPDPS), 2015 IEEE International , pp. 561–570,IEEE, 2015.[57] Y. Liu and B. Schmidt, “GSWABE: faster GPU-accelerated sequence align-ment with optimal alignment retrieval for short DNA sequences,” Concur-rency and Computation: Practice and Experience , vol. 27, no. 4, pp. 958–972, 2015. 13058] H. M. Waidyasooriya, M. Hariyama, and M. Kameyama, “FPGA-accelerator for DNA sequence alignment based on an efficient data-dependent memory access scheme,” Highly-Efficient Accelerators and Re-configurable Technologies , pp. 127–130, 2014.[59] Y. Liu, A. Wirawan, and B. Schmidt, “CUDASW++ 3.0: acceleratingSmith-Waterman protein database search by coupling CPU and GPU SIMDinstructions,” BMC bioinformatics , vol. 14, no. 1, p. 117, 2013.[60] J. Arram, K. H. Tsoi, W. Luk, and P. Jiang, “Hardware acceleration ofgenetic sequence alignment,” in International Symposium on Applied Re-configurable Computing , pp. 13–24, Springer, 2013.[61] D. Weese, M. Holtgrewe, and K. Reinert, “RazerS 3: faster, fully sensitiveread mapping,” Bioinformatics , vol. 28, no. 20, pp. 2592–2599, 2012.[62] A. Ahmadi, A. Behm, N. Honnalli, C. Li, L. Weng, and X. Xie, “Hobbes:optimized gram-based methods for efficient read alignment,” Nucleic acidsresearch , vol. 40, no. 6, pp. e41–e41, 2011.[63] G. Rizk and D. Lavenier, “GASSST: global alignment short sequence searchtool,” Bioinformatics , vol. 26, no. 20, pp. 2534–2540, 2010.[64] D. Weese, A.-K. Emde, T. Rausch, A. D¨oring, and K. Reinert, “Raz-erS—fast read mapping with sensitivity control,” Genome research , vol. 19,no. 9, pp. 1646–1654, 2009.[65] H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, andO. Mutlu, “Shifted Hamming distance: a fast and accurate SIMD-friendlyfilter to accelerate alignment verification in read mapping,” Bioinformatics ,vol. 31, no. 10, pp. 1553–1560, 2015.[66] T. Nishimura, J. L. Bordim, Y. Ito, and K. Nakano, “Accelerating theSmith-Waterman Algorithm Using Bitwise Parallel Bulk ComputationTechnique on GPU,” in Parallel and Distributed Processing SymposiumWorkshops (IPDPSW), 2017 IEEE International , pp. 932–941, IEEE, 2017.13167] P. Chen, C. Wang, X. Li, and X. Zhou, “Accelerating the next generationlong read mapping with the FPGA-based system,” IEEE/ACM transac-tions on computational biology and bioinformatics , vol. 11, no. 5, pp. 840–852, 2014.[68] H. M. Waidyasooriya and M. Hariyama, “Hardware-acceleration of short-read alignment based on the Burrows-Wheeler transform,” IEEE Transac-tions on Parallel and Distributed Systems , vol. 27, no. 5, pp. 1358–1372,2016.[69] R. C. Edgar and K. Sj¨olander, “A comparison of scoring functions for pro-tein sequence profile alignment,” Bioinformatics , vol. 20, no. 8, pp. 1301–1308, 2004.[70] H. Xin, S. Nahar, R. Zhu, J. Emmons, G. Pekhimenko, C. Kingsford,C. Alkan, and O. Mutlu, “Optimal seed solver: optimizing seed selection inread mapping,” Bioinformatics , vol. 32, no. 11, pp. 1632–1642, 2015.[71] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, “Gate-Keeper: a new hardware architecture for accelerating pre-alignment in DNAshort read mapping,” Bioinformatics , vol. 33, no. 21, pp. 3355–3363, 2017.[72] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, “Gate-Keeper: A New Hardware Architecture for Accelerating Pre-Alignment inDNA Short Read Mapping,” arXiv preprint arXiv:1604.01789 , 2016.[73] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, “Shouji: a fastand efficient pre-alignment filter for sequence alignment,” Bioinformatics ,2019.[74] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, “SLIDER: Fastand Efficient Computation of Banded Sequence Alignment,” arXiv preprintarXiv:1809.07858 , 2018.[75] M. Alser, O. Mutlu, and C. Alkan, “MAGNET: Understanding and im-proving the accuracy of genome pre-alignment filtering,” Transactions onInternet Research , vol. 13, no. 2, pp. 33–42, 2017.13276] M. Alser, O. Mutlu, and C. Alkan, “MAGNET: Understanding and im-proving the accuracy of genome pre-alignment filtering,” arXiv preprintarXiv:1707.01631 , 2017.[77] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, “Explor-ing Speed/Accuracy Trade-offs in Hardware Accelerated Pre-Alignment inGenome Analysis,” in HPCA2018 Workshop on Accelerator Architecture inComputational Biology and Bioinformatics (AACBB), , Vienna, Austria,February 24th, 2018. https://aacbb-workshop.github.io/ , 2018.[78] M. ˇSoˇsi´c and M. ˇSiki´c, “Edlib: a C/C++ library for fast, exact sequencealignment using edit distance,” Bioinformatics , vol. 33, no. 9, pp. 1394–1395, 2017.[79] P. Weiner, “Linear pattern matching algorithms,” in Switching and Au-tomata Theory, 1973. SWAT’08. IEEE Conference Record of 14th AnnualSymposium on , pp. 1–11, IEEE, 1973.[80] M. T. ¨Ozsu and P. Valduriez, Principles of distributed database systems .Springer Science & Business Media, 2011.[81] M. BURROWS, “A block-sorting lossless data compression algorithm,” SRC Research Report, 124 , 1994.[82] P. Ferragina and G. Manzini, “Opportunistic data structures with applica-tions,” in Foundations of Computer Science, 2000. Proceedings. 41st AnnualSymposium on , pp. 390–398, IEEE, 2000.[83] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, andM. Brudno, “SHRiMP: accurate mapping of short color-space reads,” PLoScomputational biology , vol. 5, no. 5, p. e1000386, 2009.[84] H. Li and R. Durbin, “Fast and accurate short read alignment withBurrows–Wheeler transform,” Bioinformatics , vol. 25, no. 14, pp. 1754–1760, 2009.[85] H. Li and R. Durbin, “Fast and accurate long-read alignment with Burrows–Wheeler transform,” Bioinformatics , vol. 26, no. 5, pp. 589–595, 2010.13386] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast andmemory-efficient alignment of short DNA sequences to the human genome,” Genome biology , vol. 10, no. 3, p. R25, 2009.[87] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang,“SOAP2: an improved ultrafast tool for short read alignment,” Bioinfor-matics , vol. 25, no. 15, pp. 1966–1967, 2009.[88] C.-M. Liu, T. Wong, E. Wu, R. Luo, S.-M. Yiu, Y. Li, B. Wang, C. Yu,X. Chu, K. Zhao, et al. , “SOAP3: ultra-fast GPU-based parallel alignmenttool for short reads,” Bioinformatics , vol. 28, no. 6, pp. 878–879, 2012.[89] C. Firtina and C. Alkan, “On genomic repeats and reproducibility,” Bioin-formatics , vol. 32, no. 15, pp. 2243–2247, 2016.[90] I. Medina, J. T´arraga, H. Mart´ınez, S. Barrachina, M. Castillo, J. Paschall,J. Salavert-Torres, I. Blanquer-Espert, V. Hern´andez-Garc´ıa, E. Quintana-Ort´ı, et al. , “Highly sensitive and ultrafast read mapping for RNA-seq anal-ysis,” DNA Research , vol. 23, no. 2, pp. 93–100, 2016.[91] R. Li, Y. Li, K. Kristiansen, and J. Wang, “SOAP: short oligonucleotidealignment program,” Bioinformatics , vol. 24, no. 5, pp. 713–714, 2008.[92] H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads andcalling variants using mapping quality scores,” Genome research , vol. 18,no. 11, pp. 1851–1858, 2008.[93] A. D. Smith, Z. Xuan, and M. Q. Zhang, “Using quality scores and longerreads improves accuracy of Solexa read mapping,” BMC bioinformatics ,vol. 9, no. 1, p. 128, 2008.[94] H. Lin, Z. Zhang, M. Q. Zhang, B. Ma, and M. Li, “ZOOM! Zillions ofoligos mapped,” Bioinformatics , vol. 24, no. 21, pp. 2431–2437, 2008.[95] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basiclocal alignment search tool,” Journal of molecular biology , vol. 215, no. 3,pp. 403–410, 1990. 13496] N. Homer, B. Merriman, and S. F. Nelson, “BFAST: an alignment tool forlarge scale genome resequencing,” PloS one , vol. 4, no. 11, p. e7767, 2009.[97] M. David, M. Dzamba, D. Lister, L. Ilie, and M. Brudno, “SHRiMP2:sensitive yet practical short read mapping,” Bioinformatics , vol. 27, no. 7,pp. 1011–1012, 2011.[98] F. Hormozdiari, F. Hach, S. C. Sahinalp, E. E. Eichler, and C. Alkan,“Sensitive and fast mapping of di-base encoded reads,” Bioinformatics ,vol. 27, no. 14, pp. 1915–1921, 2011.[99] W.-P. Lee, M. P. Stromberg, A. Ward, C. Stewart, E. P. Garrison, and G. T.Marth, “MOSAIK: a hash-based algorithm for accurate next-generationsequencing short-read mapping,” PloS one , vol. 9, no. 3, p. e90581, 2014.[100] J. Blom, T. Jakobi, D. Doppmeier, S. Jaenicke, J. Kalinowski, J. Stoye,and A. Goesmann, “Exact and complete short-read alignment to microbialgenomes using Graphics Processing Unit programming,” Bioinformatics ,vol. 27, no. 10, pp. 1351–1358, 2011.[101] E. Ukkonen, “Algorithms for approximate string matching,” Informationand control , vol. 64, no. 1-3, pp. 100–118, 1985.[102] D. Yorukoglu, Y. W. Yu, J. Peng, and B. Berger, “Compressive mapping fornext-generation sequencing,” Nature biotechnology , vol. 34, no. 4, p. 374,2016.[103] J. Arram, K. H. Tsoi, W. Luk, and P. Jiang, “Reconfigurable acceleration ofshort read mapping,” in Field-Programmable Custom Computing Machines(FCCM), 2013 IEEE 21st Annual International Symposium on , pp. 210–217, IEEE, 2013.[104] C. B. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, andW. L. Ruzzo, “Hardware acceleration of short read mapping,” in Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th An-nual International Symposium on , pp. 161–168, IEEE, 2012.135105] Y. Sogabe and T. Maruyama, “FPGA acceleration of short read mappingbased on sort and parallel comparison,” in Field Programmable Logic andApplications (FPL), 2014 24th International Conference on , pp. 1–4, IEEE,2014.[106] S. M. Kie(cid:32)lbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith, “Adaptiveseeds tame genomic sequence comparison,” Genome research , vol. 21, no. 3,pp. 487–493, 2011.[107] L. Egidi and G. Manzini, “Better spaced seeds using quadratic residues,” Journal of Computer and System Sciences , vol. 79, no. 7, pp. 1144–1155,2013.[108] R. Lederman, “A random-permutations-based approach to fast read align-ment,” in BMC bioinformatics , vol. 14, p. S8, BioMed Central, 2013.[109] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin,C. Alkan, and O. Mutlu, “GRIM-Filter: fast seed filtering in read mappingusing emerging memory technologies,” arXiv preprint arXiv:1708.04329 ,2017.[110] J. Kim, D. Senol, H. Xin, D. Lee, M. Alser, H. Hassan, O. Ergin, C. Alkan,and O. Mutlu, “Genome Read In-Memory (GRIM) Filter: Fast LocationFiltering in DNA Read Mapping using Emerging Memory Technologies,”2017.[111] W. J. Masek and M. S. Paterson, “A faster algorithm computing stringedit distances,” Journal of Computer and System Sciences , vol. 20, no. 1,pp. 18–31, 1980.[112] A. Backurs and P. Indyk, “Edit Distance Cannot Be Computed inStrongly Subquadratic Time (unless SETH is false),” arXiv preprintarXiv:1412.0348v4 , 2017.[113] A. Al Kawam, S. Khatri, and A. Datta, “A Survey of Software and Hard-ware Approaches to Performing Read Alignment in Next Generation Se-quencing,” IEEE/ACM Transactions on Computational Biology and Bioin-formatics (TCBB) , vol. 14, no. 6, pp. 1202–1213, 2017.136114] S. Aluru and N. Jammula, “A review of hardware acceleration for compu-tational genomics,” IEEE Design & Test , vol. 31, no. 1, pp. 19–30, 2014.[115] H.-C. Ng, S. Liu, and W. Luk, “Reconfigurable acceleration of geneticsequence alignment: A survey of two decades of efforts,” in Field Pro-grammable Logic and Applications (FPL), 2017 27th International Confer-ence on , pp. 1–8, IEEE, 2017.[116] E. F. D. O. Sandes, A. Boukerche, and A. C. M. A. D. Melo, “Paralleloptimal pairwise biological sequence comparison: Algorithms, platforms,and classification,” ACM Computing Surveys (CSUR) , vol. 48, no. 4, p. 63,2016.[117] H. Kung, “Why Systolic Architectures?,” Computer , vol. 15, no. 1, pp. 37–46, 1982.[118] Y.-T. Chen, J. Cong, Z. Fang, J. Lei, and P. Wei, “When apache sparkmeets FPGAs: a case study for next-generation DNA sequencing accel-eration,” in Proceedings of the 8th USENIX Conference on Hot Topics inCloud Computing , pp. 64–70, USENIX Association, 2016.[119] Y.-T. Chen and J. Cong, “CS-BWAMEM: A fast and scalable read alignerat the cloud scale for whole genome sequencing,” in Proceedings of HighThroughput Sequencing Algorithms and Applications (HITSEQ) , 2015.[120] C. Wang, R.-X. Yan, X.-F. Wang, J.-N. Si, and Z. Zhang, “Comparisonof linear gap penalties and profile-based variable gap penalties in profile–profile alignments,” Computational biology and chemistry , vol. 35, no. 5,pp. 308–318, 2011.[121] S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices fromprotein blocks,” Proceedings of the National Academy of Sciences , vol. 89,no. 22, pp. 10915–10919, 1992.[122] A. Hatem, D. Bozda˘g, A. E. Toland, and ¨U. V. C¸ ataly¨urek, “Benchmarkingshort sequence mapping tools,” BMC bioinformatics , vol. 14, no. 1, p. 184,2013. 137123] M. C. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, andD. DiSabello, “Achieving high performance with FPGA-based computing,” Computer , vol. 40, no. 3, 2007.[124] S. M. Trimberger, “Three ages of FPGAs: A retrospective on the firstthirty years of FPGA technology,” Proceedings of the IEEE , vol. 103, no. 3,pp. 318–331, 2015.[125] N. McVICAR, A. Hoshino, A. La Torre, T. A. Reh, W. L. Ruzzo, andS. Hauck, “FPGA Acceleration of Short Read Alignment,” arXiv preprintarXiv:1805.00106 , 2018.[126] N. Alachiotis, D. Theodoropoulos, and D. Pnevmatikatos, “Versatile de-ployment of FPGA accelerators in disaggregated data centers: A bioinfor-matics case study,” in Field Programmable Logic and Applications (FPL),2017 27th International Conference on , pp. 1–4, IEEE, 2017.[127] M. Assaad and M. Alser, “An FPGA-based design and implementation ofan all-digital serializer for inter module communication in SoC,” IEICEElectronics Express , vol. 8, no. 23, pp. 2017–2023, 2011.[128] M. H. Alser, Accelerating the understanding of life’s code through betteralgorithms and hardware design . PhD thesis, Bilkent University, 2018.[129] M. H. Alser, M. M. Assaad, and F. A. Hussin, “A wide-range programmablefrequency synthesizer based on a finite state machine filter,” InternationalJournal of Electronics , vol. 100, no. 11, pp. 1546–1556, 2013.[130] M. Assaad and M. H. Alser, “Design of an all-digital synchronized fre-quency multiplier based on a dual-loop (D/FLL) architecture,” VLSI De-sign , vol. 2012, p. 17, 2012.[131] M. H. Alser, M. Assaad, F. A. Hussin, and I. Yohannes, “Design and fpgaimplementation of pll-based quarter-rate clock and data recovery circuit,”in , vol. 2, pp. 825–830, IEEE, 2012.138132] M. H. Alser and M. M. Assaad, “Design and modeling of low-power clocklessserial link for data communication systems,” in , pp. 1–5, IEEE, 2011.[133] X. Virtex, “FPGA VC709 connectivity kit,” Xilinx: All Programmable , 7.[134] M. Jacobsen, D. Richmond, M. Hogains, and R. Kastner, “RIFFA 2.1: Areusable integration framework for FPGA accelerators,” ACM Transactionson Reconfigurable Technology and Systems (TRETS) , vol. 8, no. 4, p. 22,2015.[135] U. Guide, “Series FPGAs configurable logic block,” Xilinx, San Jose, CA ,vol. 1, 7.[136] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A.Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memoryaccelerator for bulk bitwise operations using commodity DRAM technol-ogy,” in Proceedings of the 50th Annual IEEE/ACM International Sympo-sium on Microarchitecture , pp. 273–287, ACM, 2017.[137] D. J. Lipman and W. R. Pearson, “Rapid and sensitive protein similaritysearches,” Science , vol. 227, no. 4693, pp. 1435–1441, 1985.[138] J. W. J. Williams, “Algorithm 232: heapsort,” Communications of theACM , vol. 7, no. 6, pp. 347–348, 1964.[139] M. McNamara et al. , “IEEE Standard Verilog Hardware Description Lan-guage. The Institute of Electrical and Electronics Engineers,” Inc. IEEEStd , pp. 1364–2001, 2001.[140] R. Tarjan, “Depth-first search and linear graph algorithms,” SIAM journalon computing , vol. 1, no. 2, pp. 146–160, 1972.[141] G. Dimitrakopoulos, K. Galanopoulos, C. Mavrokefalidis, and D. Niko-los, “Low-power leading-zero counting and anticipation logic for high-speedfloating point units,” IEEE transactions on very large scale integration(VLSI) systems , vol. 16, no. 7, pp. 837–850, 2008.139142] C. Alkan, J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hor-mozdiari, J. O. Kitzman, C. Baker, M. Malig, O. Mutlu, et al. , “Person-alized copy number and segmental duplication maps using next-generationsequencing,” Nature genetics , vol. 41, no. 10, p. 1061, 2009.[143] . G. P. Consortium et al. , “An integrated map of genetic variation from1,092 human genomes,” Nature , vol. 491, no. 7422, p. 56, 2012.[144] M. S. Lindner, B. Strauch, J. M. Schulze, S. H. Tausch, P. W. Dabrowski,A. Nitsche, and B. Y. Renard, “HiLive: real-time mapping of illumina readswhile sequencing,” Bioinformatics , vol. 33, no. 6, pp. 917–319, 2016.[145] S. Salinas and P. Li, “Secure Cloud Computing for Pairwise Sequence Align-ment,” in Proceedings of the 8th ACM International Conference on Bioin-formatics, Computational Biology, and Health Informatics , pp. 178–183,ACM, 2017.[146] M. Alser, N. Almadhoun, A. Nouri, C. Alkan, and E. Ayday, “Can you Re-ally Anonymize the Donors of Genomic Data in Today’s Digital World?,” in Data Privacy Management, and Security Assurance , pp. 237–244, Springer,2015.[147] N. LaPierre, S. Mangul, M. Alser, I. Mandric, N. C. Wu, D. Koslicki, andE. Eskin, “MiCoP: Microbial Community Profiling method for detectingviral and fungal organisms in metagenomic samples,” bioRxiv , p. 243188,2018.[148] M. Alser, N. Almadhoun, A. Nouri, C. Alkan, and E. Ayday, “IdentifyingAnonymous Donors of Genetic Information,” 2015.140 ppendix AData In this chapter, we use Edlib [78] (edit distance mode) to assess the numberof accepted (i.e., having edits less or equal to the edit distance threshold) andrejected (i.e., having more edits than the edit distance threshold) pairs for eachof the 12 datasets. We also provide the time, in seconds, that is needed for Edlibto complete clustering the pairs. We provide these details for set 1, set 2, set 3,and set 4 in Table A.1. We also provide the same details for set 5, set 6, set 7,and set 8 in Table A.2 and for set 9, set 10, set 11, and set 12 in Table A.3.Next, we provide the effect of reducing the search space of our SneakySnakealgorithm and the best exact edit distance algorithm, Edlib [78]. We use thefollowing datasets set 1, set 2, set 3, and set 4 in Figure A.4. We also use thesedatasets set 5, set 6, set 7, and set 8 in Figure A.5. Finally, in Figure A.6, weuse the following datasets set 9, set 10, set 11, and set 12.141able A.1: Details of our first four datasets (set 1, set 2, set 3, and set 4). Weuse Edlib [78] (edit distance mode) to benchmark the accepted and the rejectedpairs for edit distance thresholds of 0 up to 10 edits. Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 110.61 381,901 29,618,099 0 120.84 124,531 29,875,469 1 135.30 1,345,842 28,654,158 1 121.67 441,927 29,558,073 2 126.87 3,266,455 26,733,545 2 121.33 1,073,808 28,926,192 3 125.46 5,595,596 24,404,404 3 117.88 2,053,181 27,946,819 4 122.09 7,825,272 22,174,728 4 113.79 3,235,057 26,764,943 5 121.73 9,821,308 20,178,692 5 115.00 4,481,341 25,518,659 6 126.26 11,650,490 18,349,510 6 115.26 5,756,432 24,243,568 7 124.86 13,407,801 16,592,199 7 116.10 7,091,373 22,908,627 8 126.83 15,152,501 14,847,499 8 116.71 8,531,811 21,468,189 9 122.94 16,894,680 13,105,320 9 117.47 10,102,726 19,897,274 10 122.60 18,610,897 11,389,103 10 118.49 11,807,488 18,192,512 Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 141.01 11,989 29,988,011 0 124.32 11 29,999,989 1 130.52 44,565 29,955,435 1 115.18 18 29,999,982 2 123.52 108,979 29,891,021 2 110.56 24 29,999,976 3 125.81 206,903 29,793,097 3 111.59 27 29,999,973 4 120.18 334,712 29,665,288 4 114.13 29 29,999,971 5 120.08 490,670 29,509,330 5 113.80 34 29,999,966 6 119.60 675,357 29,324,643 6 115.74 83 29,999,917 7 124.48 891,447 29,108,553 7 116.70 177 29,999,823 8 125.64 1,151,447 28,848,553 8 118.12 333 29,999,667 9 123.18 1,469,996 28,530,004 9 120.11 711 29,999,289 10 131.24 1,868,827 28,131,173 10 123.42 1,627 29,998,373 ERR240727_1, mrFAST -e=2, 30 million pairs ERR240727_1, mrFAST -e=3, 30 million pairsERR240727_1, mrFAST -e=5, 30 million pairs ERR240727_1, mrFAST -e=40, 30 million pairs Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 202.82 1,440,497 28,559,503 0 185.55 248,920 29,751,080 1 170.93 1,868,909 28,131,091 1 187.66 324,056 29,675,944 3 179.84 2,734,841 27,265,159 3 190.54 481,724 29,518,276 4 185.39 3,457,975 26,542,025 4 195.78 612,747 29,387,253 6 194.29 5,320,713 24,679,287 6 208.64 991,606 29,008,394 7 200.69 6,261,628 23,738,372 7 216.06 1,226,695 28,773,305 9 207.89 7,916,882 22,083,118 9 225.75 1,740,067 28,259,933 10 208.18 8,658,021 21,341,979 10 227.97 2,009,835 27,990,165 12 208.92 10,131,849 19,868,151 12 237.14 2,591,299 27,408,701 13 208.01 10,917,472 19,082,528 13 238.31 2,923,699 27,076,301 15 208.96 12,646,165 17,353,835 15 233.91 3,730,089 26,269,911 Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 293.88 444 29,999,556 0 244.79 201 29,999,799 1 273.39 695 29,999,305 1 248.65 327 29,999,673 3 269.15 927 29,999,073 3 253.53 444 29,999,556 4 264.98 994 29,999,006 4 258.86 475 29,999,525 6 281.14 1,097 29,998,903 6 280.37 529 29,999,471 7 282.71 1,136 29,998,864 7 286.30 546 29,999,454 9 287.50 1,221 29,998,779 9 292.03 587 29,999,413 10 285.02 1,274 29,998,726 10 293.44 612 29,999,388 12 290.85 1,701 29,998,299 12 301.29 710 29,999,290 13 291.93 2,146 29,997,854 13 302.68 796 29,999,204 15 287.24 3,921 29,996,079 15 299.55 1,153 29,998,847 SRR826460_1, mrFAST -e=4, 30 million pairs SRR826460_1, mrFAST -e=6, 30 million pairsSRR826460_1, mrFAST -e=10, 30 million pairs SRR826460_1, mrFAST -e=70, 30 million pairs Table A.3: Details of our last four datasets (set 9, set 10, set 11, and set 12). Weuse Edlib [78] (edit distance mode) to benchmark the accepted and the rejectedpairs for edit distance thresholds of 0 up to 25 edits. Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 337.06 707,517 29,292,483 0 375.35 43,565 29,956,435 2 319.08 1,462,242 28,537,758 2 390.71 88,141 29,911,859 5 340.47 1,973,835 28,026,165 5 449.10 119,100 29,880,900 7 349.22 2,361,418 27,638,582 7 457.05 145,290 29,854,710 10 345.73 3,183,271 26,816,729 10 462.65 205,536 29,794,464 12 339.16 3,862,776 26,137,224 12 464.26 257,360 29,742,640 15 340.40 4,915,346 25,084,654 15 528.12 346,809 29,653,191 17 340.23 5,550,869 24,449,131 17 572.97 409,978 29,590,022 20 341.09 6,404,832 23,595,168 20 521.83 507,177 29,492,823 22 341.57 6,959,616 23,040,384 22 468.11 572,769 29,427,231 25 342.30 7,857,750 22,142,250 25 467.72 673,254 29,326,746 Dataset E Time (sec) Accepted Rejected Dataset E Time (sec) Accepted Rejected0 502.22 4,389 29,995,611 0 449.02 49 29,999,951 2 469.97 8,970 29,991,030 2 467.78 163 29,999,837 5 505.53 12,420 29,987,580 5 524.47 301 29,999,699 7 490.44 15,405 29,984,595 7 533.67 375 29,999,625 10 488.32 22,014 29,977,986 10 612.16 472 29,999,528 12 482.28 27,817 29,972,183 12 655.45 520 29,999,480 15 483.46 37,710 29,962,290 15 657.91 575 29,999,425 17 483.10 44,225 29,955,775 17 547.26 623 29,999,377 20 484.07 54,650 29,945,350 20 542.69 718 29,999,282 22 483.88 62,255 29,937,745 22 541.08 842 29,999,158 25 482.39 74,761 29,925,239 25 544.22 1,133 29,998,867 SRR826471_1, mrFAST -e=8, 30 million pairs SRR826471_1, mrFAST -e=12, 30 million pairsSRR826471_1, mrFAST -e=15, 30 million pairs SRR826471_1, mrFAST -e=100, 30 million pairs E Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 E Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 Edlib-100 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-100 S.Snake-5 s e t _ s e t _ s e t _ s e t _ Table A.5: Details of evaluating the feasibility of reducing the search space forSneakySnake and Edlib, evaluated using set 5, set 6, set 7, and set 8 datasets. E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 s e t _ s e t _ s e t _ s e t _ E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 E Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 Edlib-150 Edlib-50 Edlib-25 Edlib-10 Edlib-5 S.Snake-150 S.Snake-5 s e t _ s e t _ s e t _ s e t _12