[PDF] A high-performance MEMRISTOR-based Smith-Waterman DNA sequence alignment Using FPNI structure

Abstract

This paper aims to present a new re-configuration sequencing method for difference of read lengths that may take place as input data in which is crucial drawbacks lay impact on DNA sequencing methods.

Full PDF

AA high-performance MEMRISTOR-based Smith-Waterman DNAsequence alignment Using FPNI structure

Mahdi Taheri a , Hamed Zandevakili b . Ali Mahani c Department of Electrical Engineering, Shahid Bahonar University, Kerman, Iran

ARTICLE HISTORY

Compiled September 22, 2020

ABSTRACTPurpose-

This paper aims to present a new re-conﬁguration sequencing method for diﬀer-ence of read lengths that may take place as input data in which is crucial drawbackslay impact on DNA sequencing methods.

Study design/methodology/approach-

We propose a new Race-logic implementation of the seed extension kernel ofBWA-MEM alignment algorithm. It is the ﬁrst proposed method that does notneed re-conﬁguration to execute the seed extension kernel for diﬀerent read lengthsas input. We use MEMRISTORs instead of the conventional complementary met-aloxidesemiconductor (CMOS) which leads to lower area overhead and power con-sumption. Also, we beneﬁt from Field-Programmable Nanowire Interconnect Ar-chitecture as our matrix output resulting in a ﬂexible output which bypasses thereconﬁguration procedure of the system for reads with diﬀerent lengths.

Findings-

With considering the power, area and delay eﬃciency, we gain better resultsin comparison with other state-of-the-art implementations. Consequently, we gainup to 22x speed-up in comparison with the state-of-the-art systolic arrays, 600xspeed up considering diﬀerent seed length of the previous state-of-the-art proposedmethods, at least 10x improvements in area overhead, and also, 10 x improvementsin power. Originality/value-

A new memristor based smith-waterman matrix implementation is proposed inthis work. We shows our design give this ﬂexibility to get the matrix output dependson the diﬀerent input dimension without suﬀering from the extra latency.

KEYWORDS

Bioinformatics, BWA-MEM, memristor, FPNI, Race logic.

1. Introduction

Based on the recent researches on the genomic sequence alignment, there are variety ofalgorithms and speciﬁc designs to make better performance and energy consumptionof the sequencing aligners. We can put genomics in the group of big data science andby growing technology, it is getting much bigger. The volume of produced data bygenomics can be compared with three main Big data generators(Giles, 2012): a M. Taheri. Email: [email protected] b H. Zandevakili. Email: [email protected] c A. Mahani. Email: [email protected] a r X i v : . [ c s . A R ] S e p

1) astronomy: Over these decades, Astronomy is placed in the group of Big Datachallenging.(2) YouTube: There is huge number of sharing stuﬀs and videos that are createdand shared in YouTube(3) Twitter: Makes a lot of opportunities for new insights by mining more than 400million messages that are sent every day .In Fig. 1 (Stephens et al., 2015) a comparison of these four group of data generatorsis given that shows how genomics is increasingly overcomes in case of demanding dataacquisition, storage, distribution, and analysis.

Figure 1.

Comparison of Four groups of Big Data in 2025 are shown in this Figure (Stephens et al., 2015).

The ﬁrst step for most of the genomics applications is sequence alignment. there islots of reads of DNA strand which have to be aligned against reference genome andthe best alignment for each read is produced as output. There are variety of sequencealignment tools such as :(1) Bowtie (Langmead, Trapnell, Pop, & Salzberg, 2009)(2) BWA (H. Li & Durbin, 2009)(3) MAQ (H. Li, Ruan, & Durbin, 2008)(4) SOAP (R. Li et al., 2009)(5) BWA-MEM (H. Li, 2013)Consider the state that we want to ﬁnd all local alignment by using dynamic pro-gramming approach as an example of the alignment algorithms. If we choose Smithand Waterman algorithm (Smith, Waterman, et al., 1981), which uses O(nm) time foraligning a read of length n against a reference of length m, it can be concluded thatthe approach is too slow.For example, NGS as the fastest sequencing, takes about hours with a lot of memoryusage to sequence an entire human DNA. Based on the experimental results of the(Lam, Sung, Tam, Wong, & Yiu, 2008), aligning 1000 characters as a read againsthuman genome will takes more than 15 hours.In case of real application, we works with genes or chromosomes that are about fewthousands to a few hundred million length. If we align the hole human genome withSW method, it will lasts for about days to weeks.There are other algorithms like BLAST (Kent, 2002) which are heuristic methods.They are used to ﬁnd local alignments very eﬃcient.In case of using BLAST, it takes 10-20 seconds to align a read of 1000 bp againstthe human genoem (Kent, 2002).It is obvious that with this Time-consuming calculations, general purpose processorsare not good solution for executing these bioinformatics workloads. Thus we need more2arallel and speciﬁc hardware such as GPU or FPGA that are dedicated to massivelyaccelerate the intensive computations and lead to large speedups.In this work, We accelerate the Smith-waterman-like algorithm with race logic strat-egy based on memristor elements to achieve speedup of the execution time. The rest ofthis paper is organized as follows:We provides related works in section II. Our designcontributions and details of our MEMRISTOR-based design are discussed in SectionIII. Section IV evaluates the results and ﬁnally, Section V concludes this article.

2. Related work

We are experiencing an exponential growth of experimental data and information inBiology, which is called data explosion(Marx, 2013). One of the most useful opera-tions in Bioinformatics is DNA sequencing. There are four nucleotides (A, C, G, T)that make the foundation of the DNA sequences. swapping these nucleotides, causealternate biochemical functions and products within the DNA. One of the most Severecomputational part of Bioinformatics is to ﬁnd similarities between two DNA sequenceswhich is called pairwise alignment. There are diﬀerent methods accomplish this for Bi-ologists which leads to diﬀerent time consumption. The Smith-Waterman (SW) is oneof the most accurate algorithms with high sensitivity degree but high computationaltime and high hardware resource usage. Consider that the complexity of SW is ofquadratic order. The BLAST(Altschul et al., 1997) and FASTA(Pearson & Lipman,1988) are derivative methods of SW which do not lead to optimal solutions becauseof sensitivity loss but signiﬁcantly faster. Another dynamic programming method forcomparing two macro molecules is Needleman-Wunsch algorithm (NW)(Wilbur & Lip-man, 1984) which calculate the alignment score between two sequence based on theLevenshtein distance. There is diﬀerent other eﬀorts to reduce computational time ofdiﬀerent parts of the pairwise alignment algorithms. A custom ASIC implementation ofa BioSCAN is introduced in (Singh, Hoﬀman, Tell, & White, 1996) in which heuristicand very high density implementation caused in high performance. A new method ofinformation representation was proposed in (Madhavan, Sherwood, & Strukov, 2014)that performs computation by setting up logical race conditions in a circuit on ASICplatform and they achieved about 3x higher throughput at 5x lower power density. Theauthors in (Rucci et al., 2018) evaluates SWIFOLD:a Smith-Waterman parallel Imple-mentation for Long DNA sequences which is implemented on Intel core with OpenCLand they claim that their method increases better performance with higher resourceconsumption. In another work, In (Zokaee, Zarandi, & Jiang, 2018) a ReRAM-basedprocess-in-memory architecture is designed to improve short read alignment through-put per Watt by 13 × . The authors in (Alser et al., 2017) propose a new hardwareaccelerator in which the most incorrect candidate locations ﬁlls out with 130-foldspeedup than software.There is a faster implementation of SW in (Farrar, 2006) whichachieves 2 to × performance improvement in comparison of other SIMD based SW im-plementations.Also, intrinsic delay of the circuits edit-distance computation elementsas in (Banerjee et al., 2018) was utilized to propose the ASAP accelerator based onthe RACE-logic hardware acceleration that is presented in (Madhavan et al., 2014)for accelerating SW and NW algorithms on an ASIC platform.Their work leads to200 × speedup than an equivalent Smith-Waterman-C implementation.There are someother works that Accelerated BWA-MEM genomic Mapping Algorithm on diﬀerentplatforms such as GPU and FPGA. BWA-MEM is widely used algorithm to map ge-nomic sequences onto a reference genome. This algorithm is composed of three main3omputational kernel(H. Li, 2013) :(1) SMEM Generation: This kernel is used to ﬁnd seeds (sub-strings of the reads)that are likely mapping locations of the read against the reference genome. Thereis a chance of generating several seeds with variable length for each read (H. Li &Durbin, 2009). This step is an exact-match-ﬁnding phase that uses the Burrows-Wheeler transform. For this work, seeds are at least nineteen characters and amaximum of 131.(2) Seed Extension: This steps is an inexact-matching step that executes chainingand extending of seeds in two directions by using a Smith-Waterman-like al-gorithm (Smith et al., 1981). This part of the BWA-MEM algorithm ﬁnds theoptimal local alignment by using a scoring system.(3) Output Generation: In this step the best alignment (i.e., the one with the highestscore) is ﬁnalized and provided as the output in SAM-format, if necessary.Note that the seed extension kernel used in BWA-MEM is diﬀerent from the Smith-Waterman algorithm in two substantial ways (Houtgast, Sima, Bertels, & Al-Ars,2015): (1) Non-zero initial values: The initial values in the ﬁrst column and row de-pend on the alignment score of the seed found by the SMEM Generation kernel. (2)Additional Output Generation: Other than the local and global alignment scores, theexact location inside the similarity matrix and a maximum oﬀset (indicating the dis-tance from the diagonal at which a maximum score has been found) are also generated.Several techniques have been proposed to accelerate the BWA-MEM inexact align-ment algorithm. However, the seed extension step of this algorithm makes it inherentlya slow design1. Table 1.

Proﬁling the BWA-MEM algorithm (Houtgast et al., 2015).

Kernel % Execution time Bound

SMEM generation 56%

Memory

Seed extension 32%

Computational

Output generation 9%

Memory

Other 3%

I/O

The ﬁrst accelerated implementation of BWA-MEM is presented in (Houtgast etal., 2015) with evaluating a number of FPGA-based systolic array architecture. Theirimplementation is 3 × faster than the software-only execution. A hardware accelerationof the BWA-MEM genomics short read mapping for longer read length is implementedin this article (Houtgast, Sima, Bertels, & Al-Ars, 2018). This design is based on apreviously proposed architecture (Houtgast et al., 2015), where an FPGA-based 1D-systolic array is used to accelerate the BWA-MEM genomics mapping algorithm. Themain idea is to insert some exit points between the PEs of the 1D-systolic array toavoid unnecessary calculations for shorter reads. By doing so, shorter reads do nothave to go through all of the PEs and can exit the array once they get to the ﬁrstexit point. The authors discussed acceleration of the Seed Extension kernel of theBWA-MEM algorithm on a GPU accelerator and achieved up to 1 . × improvementin comparison of application-level execution time(Houtgast, Sima, Bertels, & Al-Ars,2016). Power-Eﬃciency Analysis of Accelerated BWA-MEM Implementations on Het-erogeneous Computing Platform against the software-only baseline system is studiedin (Houtgast, Sima, Marchiori, Bertels, & Al-Ars, 2016) By oﬄoading the Seed Ex-tension phase on an accelerator. A high-performance FPGA-based Seed Extension IPcore is designed(Pham-Quoc, Kieu-Do, & Thinh, n.d.) for BWA-MEM DNA Align-ment that achieve 350 × speedup when compare to an Intel Core i5 general purpose4rocessor. Authors gain up to 14 . × speedup than the Smith-Waterman algorithmby :(a) Applying heuristics ; (b) Processing MEMs ; and, (c) Extracting MEMs byusing a bit-level parallel method(Bayat, Ga¨eta, Ignjatovic, & Parameswaran, 2019).This is considerable that after all these works, The problem of memory accessory, areaoverhead, time and power consumption of the alignment algorithms methods and im-plementations are still extremely problematic. Thus, we aimed these problems in ourwork and by our suggested methods, we improved all of the above mentioned problems.

3. Proposed design

This section describes the proposed method for ﬁlling the similarity matrix of theSmith-waterman-based algorithm and shows how it can speedup the time and reducepower consumption in comparison of the state-of-the-art architectures. Besides, ourmethod uses a unﬁxed length strategy that can leads to higher speedup due to itdoesn’t need to be reconﬁgured for diﬀerent reads lengths.There is a new data representation that is used for broad class of optimization prob-lem which is called ”Race Logic”. This method can be used for the kind of problemsthat use dynamic programming algorithms to be solved. There are diﬀerent implemen-tation of Race Logic such as synchronous and asynchronous which we focus on syn-chronous type for our design. Race Logic idea is based on the race conditions in a circuitto optimize computation in case of time. We designed a SW similarity matrix with theidea of the Race Logic design. Also we use MEMRISTOR instead of the conventionalcomplementary metaloxidesemiconductor (CMOS) which leads to better performance.In addition, we considered Field-Programmable Nanowire Interconnect(Zandevakili &Mahani, 2018) Architecture as our matrix output. This is signiﬁcant that we achieve tolower power consumption and area overhead due to using a MEMRISTOR structurein comparison of the previous CMOS, ASIC and FPGA structures that is mentionedin results. and also we gain lower delay As a result of(1) Using MEMRISTOR structure that is using RACE logic strategy which leads tolower circuit delay(2) Utilization of FPNI as a ﬂexible output which bypasses the reconﬁguration pro-cedure of the system for reads with diﬀerent lengths.

First we describe the main idea of our design and show how it can lead to properanswer of the Smith-waterman-like matrix with the performance improvement. Aswe know, Smith-Waterman algorithm is a dynamic programming algorithm that cancompute the alignment score(Levenshtein distance) of two read and partial-referencegenome string with the Q,R length respectively. For calculating the scoring alignmentof these two strings, the algorithm construct a matrix S which is a lattice of size I Q × I R ( I Q , I R are the length of two strings) and with the recursive equation it cancalculate the minimum edit distance between two strings. Notice that in BWA-MEMalgorithm which is in our consideration for implementing our proposed design, thelength of two strings are as same as each other and we have a Square matrix in eachsolution. But it’s dimension may be diﬀerent based on the reads length and we solvethis problem by using FPNI as a ﬂexible output of the circuit which help us to earn allthe outputs of diﬀerent matrix dimension without any problem to change the circuit5f any re-conﬁgurations. DP ( i,i ) = M IN  DP ( i − ,i − + T ( Match,Miss − match ) DP ( i − ,i ) + T ( Gap ) DP ( i,i − + T ( Gap ) (1)where DP denotes the similarity matrix, T ( Match,Miss − match ) is the assigned score forwhen a match or a mismatch occurs (usually 0 for a match and a 2 for a mismatch(Banerjee et al., 2018)), and T ( Gap ) is the gap penalty with usual 1 value(Banerjeeet al., 2018).It worth mentioning that Match is for situation that two correspondingnucleotide are the same as each other and Miss-match is the state that they are notthe same. Notice that we can choose these parameters in the way that optimize theaccuracy of the alignment based on the structure of the sequences that are beingcompared (Wang, Yan, Wang, Si, & Zhang, 2011)(Henikoﬀ & Henikoﬀ, 1992)(Sung,2009). Besides, we use ﬁxed penalties for the gap between nucleotides with the valuethat is more commonly used(Sung, 2009). The above equation which is a represen-tative of the Smith-Waterman similarity matrix local alignment leads to ﬁnding thelargest sub-string of R which is mapped with string Q with the lowest Levenshtein dis-tance(LD)(See (Navarro, 2001),(Levenshtein, 1966) for more information). Althoughthis method is accurate and yield to optimal alignment with high computational com-plexity. To overcome this problem, we can replace the LD values in Equation1 withtheir equivalent propagation delays and use the delay based approach for additionand minimization. Consider that, these two operations(addition and minimization)are necessary for recursive equation1For more clearance, we give some examples of how the addition and minimizationoperations can be modeled by Race Logic strategy.- Race logic:Suppose that we have two signals(M and N) that are set to logic value ’1’ (inject ahigh signal) at diﬀerent times. This time delay is representing the diﬀerent values ofthese two signals.For example, consider that the signal M is set to ’1’ with a speciﬁctime delay(time delay = D1) that means the value of M is ”D1” and the second signalis set after D2 second time delay(time delay = D2) that mean N value is ”D2”.(1) If we want to add these two values with each other, we can combine the circuitelements of M and N together in series. That mean the total propagation delayof the output is a result of adding ”D1” with ”D2”.(2) If we connect these two circuit elements to an OR gate, the signal that arriveﬁrst to OR gate, emerges out of that. This structure is a Viewer of minimizationoperator. Because both signals have the ’1’ value and the signal which have theless amount of delay, will arrives ﬁrst to OR gate and make the output of thisgate ’1’ earlier.(3) For calculating the value of the output, we can place a counter at the end of ourRace Logic design that serve as a decoder(Banerjee et al., 2018).We can apply these delay-based computations to SW similarity matrix of LD cal-culation. So the delay between the rise edge of the input signal in the lattice and itsemergence at any of the element on the last row that is the minimum score of the localalignment. 6 igure 2. Computing with propagation delays: Delay-based proxy for the addition operator is a series con-nection, and the proxy for the min operator is the OR gate(Banerjee et al., 2018).

Figure 3. accelerated architecture.

Fig 3 demonstrate our accelerated architecture. It includes some basic cells to im-plement the desired functionality, and a routing network to easy access to the outputof some predeﬁned basic cells. More details about the diﬀerent parts of our proposedarchitecture will be presented in the following:- MEMRISTOR-element:Memristors (L. Chua, 1971) are new two-terminal logical and scientiﬁc basis and fourthclassical circuit elements as same as the resistor, inductor, and capacitor.Memristors are changeable resistors which can be used for memory. In this case,the resistance will stored as data. We can also use Memristive devices(L. O. Chua &7ang, 1976) in other applications such as logic and analog circuits.We can refer to some points of using memristors instead of CMOS circuits in ourRace Logic :(1) With these devices, we can read and write data faster than CMOS cir-cuits(Torrezan, Strachan, Medeiros-Ribeiro, & Williams, 2011).(2) They are typically small devices. Hence, the CMOS circuits usually bigger thanthe memristive-based circuits.(3) Nonvolatility is the main feature of memristors and their compatible with stan-dard CMOS technology(Borghetti et al., 2009). They are eather ideal for FPGA-like applicationsFrom above, we can conclude that Memristive devices provides nonvolatile, dense,fast, and power eﬃciency to solving many major problems of the semiconductor de-vices.Consider that we make a programmable design in which user can set the corre-sponding delay of ”match”, ”mismatch”, and ”gap” penalties. For example, when weknow that the most nucleotide comparisons are match, we can encode in the way that”match” delay has ’0’ time delay and this ensures us that large portions of our SWmatrix is taken zero time to be explored. Diﬀerent values for penalties help us to op-timize the search time.- Basic cells:The schematic of our proposed cell is shown in Fig. 4. Accordingly, it includes threedelay elements (DM, DI, DD) which are responsible for the mathematical operationsof the Eq. 1. respectively, a comparator/selector unit which has to compare the valueof two nucloetides that are the inputs of each matrix cell and decides if match ormismatch occurs ,one local OR gate to implement the Min operation in Eq. 1 , andone global OR gate to give us the ﬂexibility of choosing output from diﬀerent stagesof the SW matrix.- The comparator/selector unit:This section includes several CMOS XNOR gates and a memristor-based NAND gatewhich are used to compare the Ref and Read data Also the multiplexer controlled bythe output of the comparator stage, that deﬁnes the corresponding match or mismatchpenalty as its output. When the output value of the comparator becomes 0 this meansthe Ref data is equal to the Read data; and the proportional delay value for match(which can be deﬁned by user in our design) goes out as output of the selector unit.The structure of our proposed comparator/selector unit is shown in Fig.5.- The delay element (DE):Delay elements are composed of :(1) Three input wavefront which are representation of the input signals and are theresults of the preceding DEs in the grid(2) Two corresponding nucleotides as input signals which have to be compared bythe element(3) Three input signals representing the (Match, Mismatch, Gap penalty) values.(4) One output signal (global OR gate) which represent the output of the Eq. 1( DP ( i,i ) )(5) One output signal (local OR gate) which is designed to perform our desiredﬂexible matrix output and is used for local alignment.8 igure 4. Basic cell of our proposed design. the propagated output wavefront of each DE is a delay signal with considerationof the corresponding match, mismatch and gap delay penalties. When the other DEsoutputs or signal wavefront reaches an element, a delay is created based on the gappenalty speciﬁed for match/mismatch and gap penalty, by propagating the signalsthrough the memristors. The other advantage of our design is that it allows the userto program (i.e., dynamically set at runtime) the value of the match, mismatch and gappenalty based on the diﬀerent applications and give the ﬂexibility to use our approachin cases that merely require re-parameterization of the gap-penalties. The structure ofour proposed delay element is shown inFig.6. It includes some delay elements to builddiﬀerent delays, and a multiplexer to select the desired delay. As shown in Fig.6, toreduce the area overhead, we have used memristors to implement the delay elements.- Local OR gate:The local OR gate is used to make it possible to avoid unnecessary latency that is dueto variable input length. OR gate is a proxy for minimization operator, which emergesout the signal that arrives ﬁrst at the gate. As shown in Fig.7, to reduce the areaoverhead, we have used a memristor-based OR gate for this sake.- Global OR gate:The global OR gate is used to implement the minimization operation in the Eq. 1. The structure of our proposed global OR gate is shown in Fig.7. We have used amemristor-based OR gate for this sake to reduce the area overhead.- The routing network:Needleman and Wunsch and Smith and Waterman algorithm are well-known dynamicprogramming algorithm which leads to optimum global and local alignment of a Readagainst the reference genome. In these approaches, a similarity matrix is ﬁlled that hasto ﬁnd the local and global alignment score of reads against the corresponding reference9 igure 5. comparator/selector unit. sub-string(H. Li, 2013). Consider the practical scenario that read data has at most150 basepairs ( bp ) for our comparison. Then we construct our similarity matrix with 131 ×

131 dimension based on the BWA-MEM approach. we desire that the processing timeof ﬁlling the similarity matrix kernel be independent of the read length but because ofthe ﬁxed similarity matrix dimension, for shorter reads, we incur unnecessary latency.To avoid this unnecessary latency, we have to contemplate a method which can beﬂexible with diﬀerent read length and get output ready from the desired dimensionsof similarity matrix. Therefore, we can omit the unnecessary latency which is a reasonof not traveling through the entire elements irrespective of its length.The original Race Logic design was demonstrated in simulation as an ASIC [13].Even though this method has advantages in power consumption and substantial im-provement in throughput in comparison of the state-of-the-art systolic implementa-tions. But it suﬀers from the following problems:(1) The original Race Logic design use conventional complementary metaloxidesemi-conductor (CMOS) that has the size, power consumption, read and write timeproblems in comparison of our approach.(2) Traveling through the entire elements irrespective of its length with the ﬁxedsimilarity matrix dimension design that incur unnecessary latency for shorterread sizeOur proposed accelerator is runtime-programmable for changing the input datasize, which deﬁnes the size of the accelerator lattice. For this sake, we have used ananowire-based routing network which is inspired by FPNI technique (Zandevakili10 igure 6.

Delay element unit that includes some delay elements to build diﬀerent delays, and a multiplexerto select the desired delay.

Figure 7. structure of memristor base OR gate in our design. & Mahani, 2018). Field-programmable nanowire interconnect (FPNI) is new hybridstructure with advantages that are mentioned below :(1) high ﬂexibilit(2) low fabrication costBy this technique, we can change the size of the accelerator lattice during the runtimeaccording to the input data size. As shown in Fig.4, our proposed routing networkincludes some nanowires to access the output of some predeﬁned basic cells, and aselection unit which is controlled by the input data size to select the desired output.Each nanowire is connected through a via to the output of local OR gate in the desiredbasic cell.

4. Results

In this section, the simulation results of the proposed method will be compared withsome well-known approaches. Performance of the mentioned methods is evaluated us-ing several criteria such as area, delay and power consumption. In Fig 8 , the numericalresults of the proposed structure for delay parameter are compared with state-of-the-art systolic arrays and Race logic design. In general, these are two of the best im-plementations of the dynamic programming methods that achieve the accuracy andspeedup. Therefore, we compare our design to show the consummate performance ofour work.More details about each of the evaluation criteria will be presented in the following.11 able 2.

Area(nm)Readlength Proposed Systolic Race logic1 8.51E+02 7.34E+04 9.18E+032 3.40E+03 1.18E+05 2.09E+044 1.36E+04 2.34E+05 7.31E+04(1) Area : To compute the occupied area of the mentioned methods, we have usedtransistor counting technique in 65nm technology. According to the presentedresults in Table 2, the occupied area of the proposed method is compared withtwo other methods and the results shows that we achieve up to 10fold areaimprovement.(2) Delay : We need an electrical model of the nano wires, junctions and CMOScomponents to calculate the delay of the proposed structure. For this sake, wehave used the electrical model proposed in (Snider & Williams, 2007) for FPNIstructure. The electrical model for a simple circuit is shown in Fig 9. Some ofthe model parameters such as closed-junction resistance, the capacitance andresistance per unit length and geometry of the wires are also listed in Table3 (Snider & Williams, 2007). In this paper, we have used the HSpice tool tocalculate the delay of the proposed structure. According to the presented resultsin Fig 8. Fig. 10 shows how our design ﬂaunt himself in case of ﬁxed lengthmatrix dimension implementation.

Table 3.

Experimental parameters for FPNI architecture (Snider & Williams, 2007)

Parameter Description FPNI 30 nmPnano Nanowire pitch 30nmWnano Nanowire width 15 nmWpin Pin diameter 90 nmWpinvar Pin size variation 20 nmWalign Alignment error 40 nmWsep Pin/wire separation 15 nmRclosed Closed junction resistance 24 Kp On/oﬀ resistance ratio > Dynamic − power = AN CV dd f (2)Where A is the average activity of a signal, N is the number of allocatednanowires, C is the capacitance of a single nanowire, V dd is the supply voltageused by the CMOS, and f is the maximum clock frequency determined by timing12 igure 8. Latency of proposed method in comparison of the state-of-the-art systolic array and race logicdesigns. analysis. To calculate the power consumption of the proposed structure, we haveused the HSpice tool. According to the presented results in Fig. 11, we compareour design with systolic arrays and Race logic approach and results shows thosedesigns are power hungry in comparison of our memristor-base design.13 igure 9. (a) A signal with a fan-out of 2 (b) the implemented form by the nanowires (c) the electrical model(Snider & Williams, 2007).

5. Conclusion

We present a new memristor based smith-waterman matrix implementation thatachieves more than 6 times speedup in comparison of the state-of-the-art Race logicapproach and 22 times speedup than the systolic arrays implementation. We showshow our design give this ﬂexibility to get the matrix output depends on the diﬀerentinput dimension without suﬀering from the unnecessary latency.Our implementationachieves up to 600x speedup with considering the ﬁxed 131 ×

131 Smith-watermanmatrix dimension by testing diﬀerent read lengths. We also achieved at least 10ximprovements in area overhead and also 10 x improvements in power. Furthermore,our approach can be more and more practical and optimum in case of presentingprogrammable penalty matches which gives Initiative to change them based on thebiological Application. 14 igure 10. Delay ratio of proposed method, Systolic array and Race logic with considering the ﬁxed 131 × igure 11. Power consumption of our proposed design in comparison of Systolic arrays and Race logic design. . ReferencesReferences Alser, M., Hassan, H., Xin, H., Ergin, O., Mutlu, O., & Alkan, C. (2017). Gatekeeper:a new hardware architecture for accelerating pre-alignment in dna short read mapping.

Bioinformatics , (21), 3355–3363.Altschul, S. F., Madden, T. L., Sch¨aﬀer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman,D. J. (1997). Gapped blast and psi-blast: a new generation of protein database searchprograms. Nucleic acids research , (17), 3389–3402.Banerjee, S. S., El-Hadedy, M., Lim, J. B., Kalbarczyk, Z. T., Chen, D., Lumetta, S. S., &Iyer, R. K. (2018). Asap: accelerated short-read alignment on programmable hardware. IEEE Transactions on Computers , (3), 331–346.Bayat, A., Ga¨eta, B., Ignjatovic, A., & Parameswaran, S. (2019). Pairwise alignment ofnucleotide sequences using maximal exact matches. BMC bioinformatics , (1), 261.Borghetti, J., Li, Z., Straznicky, J., Li, X., Ohlberg, D. A., Wu, W., . . . Williams, R. S. (2009).A hybrid nanomemristor/transistor logic circuit capable of self-programming. Proceedingsof the National Academy of Sciences , (6), 1699–1703.Chua, L. (1971). Memristor-the missing circuit element. IEEE Transactions on circuit theory , (5), 507–519.Chua, L. O., & Kang, S. M. (1976). Memristive devices and systems. Proceedings of the IEEE , (2), 209–223.Farrar, M. (2006). Striped smith–waterman speeds database searches six times over othersimd implementations. Bioinformatics , (2), 156–161.Giles, J. (2012). Computational social science: Making the links. Nature News , (7412),448.Henikoﬀ, S., & Henikoﬀ, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences , (22), 10915–10919.Houtgast, E. J., Sima, V.-M., Bertels, K., & Al-Ars, Z. (2015). An fpga-based systolic arrayto accelerate the bwa-mem genomic mapping algorithm. In (pp. 221–227).Houtgast, E. J., Sima, V.-M., Bertels, K., & Al-Ars, Z. (2016). Gpu-accelerated bwa-memgenomic mapping algorithm using adaptive load balancing. In International conference onarchitecture of computing systems (pp. 130–142).Houtgast, E. J., Sima, V.-M., Bertels, K., & Al-Ars, Z. (2018). Hardware acceleration ofbwa-mem genomic short read mapping for longer read lengths.

Computational biology andchemistry , , 54–64.Houtgast, E. J., Sima, V.-M., Marchiori, G., Bertels, K., & Al-Ars, Z. (2016). Power-eﬃciencyanalysis of accelerated bwa-mem implementations on heterogeneous computing platforms.In (pp. 1–8).Kent, W. J. (2002). Blatthe blast-like alignment tool. Genome research , (4), 656–664.Lam, T. W., Sung, W.-K., Tam, S.-L., Wong, C.-K., & Yiu, S.-M. (2008). Compressed indexingand local alignment of dna. Bioinformatics , (6), 791–797.Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-eﬃcientalignment of short dna sequences to the human genome. Genome biology , (3), R25.Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and rever-sals. In Soviet physics doklady (Vol. 10, pp. 707–710).Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:1303.3997 .Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with burrows–wheelertransform. bioinformatics , (14), 1754–1760.Li, H., Ruan, J., & Durbin, R. (2008). Mapping short dna sequencing reads and calling variantsusing mapping quality scores. Genome research , (11), 1851–1858. i, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., & Wang, J. (2009). Soap2: animproved ultrafast tool for short read alignment. Bioinformatics , (15), 1966–1967.Madhavan, A., Sherwood, T., & Strukov, D. (2014). Race logic: A hardware accelerationfor dynamic programming algorithms. In (pp. 517–528).Marx, V. (2013). Biology: The big challenges of big data.

Nature Publishing Group.Navarro, G. (2001). A guided tour to approximate string matching.

ACM computing surveys(CSUR) , (1), 31–88.Pearson, W. R., & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences , (8), 2444–2448.Pham-Quoc, C., Kieu-Do, B., & Thinh, T. N. (n.d.). A high-performance fpga-based bwa-memdna sequence alignment. Concurrency and Computation: Practice and Experience , e5328.Rucci, E., Garcia, C., Botella, G., De Giusti, A., Naiouf, M., & Prieto-Matias, M. (2018).Swifold: Smith-waterman implementation on fpga with opencl for long dna sequences.

BMCsystems biology , (5), 96.Singh, R. K., Hoﬀman, D., Tell, S. G., & White, C. T. (1996). Bioscan: a network sharablecomputational resource for searching biosequence databases. Bioinformatics , (3), 191–196.Smith, T. F., Waterman, M. S., et al. (1981). Identiﬁcation of common molecular subsequences. Journal of molecular biology , (1), 195–197.Snider, G. S., & Williams, R. S. (2007). Nano/cmos architectures using a ﬁeld-programmablenanowire interconnect. Nanotechnology , (3), 035204.Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., . . . Robinson,G. E. (2015). Big data: astronomical or genomical? PLoS biology , (7), e1002195.Sung, W.-K. (2009). Algorithms in bioinformatics: A practical introduction . CRC Press.Torrezan, A. C., Strachan, J. P., Medeiros-Ribeiro, G., & Williams, R. S. (2011). Sub-nanosecond switching of a tantalum oxide memristor.

Nanotechnology , (48), 485203.Wang, C., Yan, R.-X., Wang, X.-F., Si, J.-N., & Zhang, Z. (2011). Comparison of linear gappenalties and proﬁle-based variable gap penalties in proﬁle–proﬁle alignments. Computa-tional biology and chemistry , (5), 308–318.Wilbur, W. J., & Lipman, D. J. (1984). The context dependent comparison of biologicalsequences. SIAM Journal on Applied Mathematics , (3), 557–567.Zandevakili, H., & Mahani, A. (2018). A new asic structure with self-repair capability usingﬁeld-programmable nanowire interconnect architecture. IEEE Transactions on Very LargeScale Integration (VLSI) Systems , (11), 2268–2278.Zokaee, F., Zarandi, H. R., & Jiang, L. (2018). Aligner: A process-in-memory architecture forshort read alignment in rerams. IEEE Computer Architecture Letters , (2), 237–240.(2), 237–240.