[PDF] Helix: Algorithm/Architecture Co-design for Accelerating Nanopore Genome Base-calling

Abstract

Nanopore genome sequencing is the key to enabling personalized medicine, global food security, and virus surveillance. The state-of-the-art base-callers adopt deep neural networks (DNNs) to translate electrical signals generated by nanopore sequencers to digital DNA symbols. A DNN-based base-caller consumes 44.5% of total execution time of a nanopore sequencing pipeline. However, it is difficult to quantize a base-caller and build a power-efficient processing-in-memory (PIM) to run the quantized base-caller. In this paper, we propose a novel algorithm/architecture co-designed PIM, Helix, to power-efficiently and accurately accelerate nanopore base-calling. From algorithm perspective, we present systematic error aware training to minimize the number of systematic errors in a quantized base-caller. From architecture perspective, we propose a low-power SOT-MRAM-based ADC array to process analog-to-digital conversion operations and improve power efficiency of prior DNN PIMs. Moreover, we revised a traditional NVM-based dot-product engine to accelerate CTC decoding operations, and create a SOT-MRAM binary comparator array to process read voting. Compared to state-of-the-art PIMs, Helix improves base-calling throughput by 6× , throughput per Watt by 11.9× and per m m 2 by 7.5× without degrading base-calling accuracy.

Full PDF

HHelix: Algorithm/Architecture Co-design for AcceleratingNanopore Genome Base-calling

Qian Lou [email protected] University Bloomington

Sarath Janga [email protected] University - PurdueUniversity Indianapolis

Lei Jiang [email protected] University Bloomington

ABSTRACT

5% of to-tal execution time of a nanopore sequencing pipeline. However,it is difficult to quantize a base-caller and build a power-efficientprocessing-in-memory (PIM) to run the quantized base-caller. Al-though conventional network quantization techniques reduce thecomputing overhead of a base-caller by replacing floating-pointmultiply-accumulations by cheaper fixed-point operations, it sig-nificantly increases the number of systematic errors that cannotbe corrected by read votes. The power density of prior nonvolatilememory (NVM)-based PIMs has already exceeded memory thermaltolerance even with active heat sinks, because their power efficiencyis severely limited by analog-to-digital converters (ADC). Finally,Connectionist Temporal Classification (CTC) decoding and readvoting cost 53.7% of total execution time in a quantized base-caller,and thus became its new bottleneck.In this paper, we propose a novel algorithm/architecture co-designed PIM, Helix, to power-efficiently and accurately acceleratenanopore base-calling. From algorithm perspective, we present sys-tematic error aware training to minimize the number of systematicerrors in a quantized base-caller. From architecture perspective,we propose a low-power SOT-MRAM-based ADC array to pro-cess analog-to-digital conversion operations and improve powerefficiency of prior DNN PIMs. Moreover, we revised a traditionalNVM-based dot-product engine to accelerate CTC decoding opera-tions, and create a SOT-MRAM binary comparator array to processread voting. Compared to state-of-the-art PIMs, Helix improvesbase-calling throughput by 6 × , throughput per Watt by 11 . × andper mm by 7 . × without degrading base-calling accuracy. CCS CONCEPTS • Hardware → Spintronics and magnetic technologies ; •

Ap-plied computing → Computational genomics . Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

KEYWORDS nanopore sequencing; base-calling; processing-in-memory

ACM Reference Format:

Qian Lou, Sarath Janga, and Lei Jiang. 2020. Helix: Algorithm/ArchitectureCo-design for Accelerating Nanopore Genome Base-calling. In

Proceedingsof the 2020 International Conference on Parallel Architectures and CompilationTechniques (PACT ’20), October 3–7, 2020, Virtual Event, GA, USA.

ACM, NewYork, NY, USA, 12 pages. https://doi.org/10.1145/3410463.3414626

Genome sequencing [8, 21, 34, 35, 37] is a cornerstone for enablingpersonalized medicine, global food security, and virus surveillance.The emerging nanopore genome sequencing technology [15] isrevolutionizing the genome research, industry and market due toits ability to generate ultra-long DNA fragments, aka long reads ,as well as provide portability. Producing long reads [23] is the keyto improving the quality of de novo assembly, spanning repetitivegenomic regions, and identifying large structural variations. More-over, portable real-time USB Flash drive size nanopore sequencers,MinION [15] and SmidgION [24], have demonstrated their powerin tracking genomes of Ebola [12], Zika [6] and COVID-19 [20]viruses during disease outbreaks.Compared to conventional short-read Illumina sequencing, nano-pore sequencing suffers high error rate [15], e.g., 12%. A nanoporesequencer measures changes in electrical current as organic DNAfragments pass through its pore. Due to the tiny amplitude of cur-rents triggered by DNA motions, a nanopore sequencer inevitablyintroduces noises into raw electrical signals, thus producing se-quencing errors. A base-caller translates raw electrical signals todigital DNA symbols, i.e., [ A , C , G , T ] . In order to reduce sequencingerrors, a sequencing machine generates multiple unique reads [15]that include a given DNA symbol. These reads are base-calledindividually, and then assembled to decide the correct value ofeach DNA symbol. The number of unique reads containing a givenDNA symbol is called coverage. Typically, the coverage is between30 ∼

50 [29, 33, 36]. To further enhance base-calling accuracy,recent works [3, 7, 29, 33, 36] use deep neural networks (DNNs)for base-calling. A DNN-based base-caller, e.g., Guppy [36], Scrap-pie [29], and Chiron [33], consists of convolutional, recurrent, fully-connected layers, as well as a Connectionist Temporal Classifica-tion (CTC) decoder. Although achieving high base-calling accuracy,prior DNN-based base-callers are slow. For instance, Guppy with itshigh base-calling accuracy obtains only 1 million base pairs per sec-ond (bp/s) on a server-level GPU.

At such a speed, it takes 25 hoursfor Guppy to base-call a 3G-bp human genome with a × coverage .During virus outbreaks, it is challenging for even a data centerequipped with powerful GPUs to processing base-calling for a large a r X i v : . [ c s . A R ] A ug ase ‐ calling overlap findingassemblyread mappingpolishingDNA readsoverlapsassemblymappingsrawsignalrefinedassembly Execution time0102030405060708090100 P e r c e n t a g e ( % ) polishing read mapping assembly overlap finding base ‐ calling Figure 1: The pipeline of nanopore sequencing. group of presumptive positive patients. As a result, base-callingbecomes the most time-consuming step in a nanopore sequencingpipeline [30].Recently, both industry [19] and academia [18, 38] proposed net-work quantization algorithms to power-efficiently accelerate DNNinferences without sacrificing inference accuracy by approximatinginputs, weights and activations of a DNN to fixed-point represen-tations with smaller bit-widths. In this way, computationally ex-pensive floating-point multiply-accumulates (MACs) in a DNN canbe replaced by fixed-point operations. Besides conventional CPUsand GPUs, FPGAs and ASICs are adopted to accelerate quantizedDNN inferences in data centers. Moreover, to further overcomethe von Neumann bottleneck in data centers, recent search effortsuse various nonvolatile memory (NVM) technologies includingReRAM [31, 40], PCM [1] and STT-MRAM [39] to build processing-in-memory (PIM) accelerators to process quantized DNN inferencesin memory arrays.However, it is difficult to apply prior network quantization tech-niques on base-callers and accelerate quantized base-callers by state-of-the-art NVM PIM architectures. Naïvely quantizing a base-callervia prior network quantization algorithms substantially increasesthe number of systematic errors that cannot be corrected by votingoperations among multiple reads containing the same DNA sym-bols. Furthermore, state-of-the-art PIM accelerators take advantageof analog computing to maximize inference throughput of quan-tized DNNs, but the functioning of their analog computing styleheavily depends on a large number of CMOS analog-to-digital con-verters (ADCs) that significantly increase their power consumptionand area overhead. For instance, CMOS ADCs cost 58% of powerconsumption and 30% of chip area in a typical PIM design [31].Finally, state-of-the-art NVM PIM designs cannot process someessential operations of a base-caller such as CTC decoding andread voting that usually consume >

50% of total execution time in aquantized base-caller.In this paper, we propose a novel algorithm and architectureco-designed PIM accelerator,

Helix , to efficiently and accuratelyprocess quantized nanopore base-calling. Our contributions aresummarized as: • Systematic error aware training . We present systematic erroraware training (SEAT) to reduce the number of systematic errorsthat cannot corrected by read votes in a quantized base-caller.We introduce a new loss function to indirectly minimize the editdistance between a consensus read and its ground truth DNAsequence. SEAT enables 5-bit quantized base-callers to achievingtheir full-precision base-calling accuracy.

HMM Metrichor Albacore Flappie Scrappie Guppy Chiron

Speed (Kbp/s)

Read accurcy (%)

Figure 2: Base-caller comparison. read read read randomA G A AG A C AC C A Aread read read systematicA G T CG T C AT C A A Figure 3: Errors. • An ADC-free PIM accelerator . We propose a Spin Orbit TorqueMRAM (SOT-MRAM)-based array architecture to accelerate analog-to-digital conversion operations without CMOS ADCs. We alsoshow our SOT-MRAM ADC arrays are resilient to process varia-tion. We modify a conventional NVM-based dot-product engineto accelerate CTC decoding operations, and then present a SOT-MRAM-based binary comparator array to process read votingoperations in a quantized base-caller. • Base-calling accuracy and throughput . We implemented allproposed techniques of Helix and compared Helix against state-of-the-art PIM designs that accelerate quantized DNN inferences.Experimental results show that, compared to state-of-the-artPIM accelerators, Helix improves base-calling throughput by28 × , throughput per Watt by 80 × , and throughput per mm by27 × without degrading accuracy. As Figure 1 shows, a nanopore sequencing pipeline [30] consistingof base-calling , overlap finding , assembly , read mapping , and pol-ishing is employed to generate a digital assembly. The input of apipeline is raw electrical signals produced by nanopore sequencers,e.g., MinION [15] and SmidgION [24]. Base-calling translates rawsignal data to digital DNA symbols, i.e., [ A , C , G , T ] . Overlap find-ing computes all suffix-prefix matches between each pair of reads,and then generates an overlap graph, where each node denotes aread and each edge indicates the suffix-prefix match between twonodes. The assembly step traverses an overlap graph to construct adraft assembly. Base-called reads are mapped to the generated draftassembly by read mapping. Lastly, the final assembly is polished. DNN-based base-caller . DNNs are adopted to filter noises andaccurately translate raw electric signals to digital DNA symbols.A DNN-based base-caller typically consists of multiple convolu-tional (Conv), gated recurrent unit (GRU), and fully-connected (FC)layers. The convolutional layers recognize local patterns in inputsignals, whereas the GRU layers integrate these patterns into base-calling probabilities. A CTC decoder is used to compute digitalDNA symbols according to the base probabilities. Compared tothe Hidden Markov Model (HMM) [22], a series of DNN-basedbase-callers including Metrichor [27], Albacore [26], Flappie [7],Scrappie [29], Guppy [36], and Chiron [33], significantly improvebase-calling accuracy, as shown in Figure 2. Among all base-callers,the Oxford Nanopore Technologies official GPU-based base-caller, onvsGRUsFCCTCAlign(a)

Structure (b) a GRU cell

A T G C - A T G C - -

T G C merge to A A Z t H t H t ‐ X t × × + × σ σ ‐ ʃ ~ H t R t (d) Beam search width t=0A ‐ (c) CTC baseprob. matrix

Figure 4: The DNN architecture of Guppy. cellBL WL D A C D A C D A C V V V BL BL G G G I =V ∙ G I =V ∙ G I =V ∙ G I =I +I +I Figure 5: A dot-product engine. fixed layer f r ee l a y e r MgOheavy metalWWL W B L R B L SLRWL I W I R Figure 6: SOT-MRAM.

Guppy, achieves the best accuracy and the highest speed. We se-lected Guppy as our base-caller baseline, and also considered otherDNN-based base-callers in §6. Due to complex DNN structures,base-callers are generally slow [36]. As a result, base-calling con-sumes 44.5% [30] of total execution time of a nanopore sequencingpipeline. The details of base-callers are introduced in §5.2.

Base-calling error . We define the number of base-calling er-rors as the edit distance between a read predicted by a base-callerand its ground truth. The edit distance quantifies how dissimilartwo reads are to one another by counting the minimum number ofinsertions, deletions, and substitutions required to transform oneinto the other. To enhance base-calling accuracy, a base-caller trans-lates each signal data multiple times and generates multiple readscontaining the same signal data. At the end of base-calling, eachDNA symbol value is decided by votes among all reads containingits corresponding signal data. As Figure 3 shows, for a DNA symbol,if base-calling errors randomly occur among reads, the voting resultcan still be correct, since most reads have the correct value. This isa random error. However, for a DNA symbol, if base-calling errorshappen in a systematic way, i.e., all copies of a signal are translatedto the same wrong value, it is impossible to produce the correctvalue by read voting. It is a systematic error.

Convolutional layer . As Figure 4a shows, a base-caller includesmultiple convolutional layers to process raw electric signals. Thefirst convolutional layer receives an L × N floating-point signal vec-tor, where L is the input length; and N indicates the input channelnumber, e.g., L = N =

1. Then, it uses a K × N × M weightfilter to convolve with the input vector to generate an output vectorfor the next activation layer [33], where K is the weight kernel size;and M means the output channel number, e.g., K =

2, and M = L × N floating-point signal vector is generated by a fixed-sizewindow sliding on the entire signal data array. After a base-callingoperation, the sliding window moves forward by T elements [33], where T is the sliding offset, e.g., T =

1. The base-caller then workson a new signal vector. At the end of base-calling, ⌊ L / T ⌋ readscontaining the same signal element vote for its value. GRU Layer . A base-caller uses a set of GRU layers to integratepatterns produced by convolutional layers into base-calling proba-bilities. As Figure 4b describes, a GRU layer receives an input X t and its output of the last time step H t − . And then, it uses twomemory cells, R t and Z t , to reset and update the gate state at thetime step t . The output H t of a GRU layer can be computed as Z t = σ ( W z X t + U z H t − ) + b z R t = σ ( W r X t + U r H t − ) + b r ˜ H t = ∫( W h X t + U h ( R t ⊗ H t − )) + b h H t = Z t ⊗ H t − + ( − Z t ) ⊗ ˜ H t (1)where W z , U z , W r , U r , W h and U h are weights for Z t , R t and hiddenstate ˜ H t respectively; b z , b r and b h are their biases; σ is the sigmoid activation; ∫ indicates the tanh activation; and ⊗ means element-wise multiplications. CTC decoder . Since it is difficult for a nanopore sequencer toprecisely control DNA motions at uniform speed, multiple elementsin the input signal vector may be generated by a single DNA nu-cleotide [33]. A base-caller adopts a CTC decoder [10, 11] to map aninput signal vector R = [ I , I , . . . , I L − ] to a corresponding digitalread D = [ H , H , . . . , H Z − ] , where L (cid:44) Z ; and there is no align-ment between R and D . More specifically, convolutional, GRU andFC layers provide all symbol probabilities p t ( a t | R ) for each timestep, where a t ∈ [ A , C , G , T , −] ( − indicates blank). The probabili-ties p t ( a t | R ) of a symbol of all time steps form a base probabilitymatrix, as shown in Figure 4c. By looking up the base probabilitymatrix, a CTC decoder can decide the probability of a read. Theprobability of D is calculated by p ( D | R ) = (cid:213) A ∈ A D , R L − (cid:214) t = p t ( a t | R ) (2)where A D , R indicates all valid alignments between D and R . TheCTC decoder infers the most likely read by a beam search on thematrix. As Figure 4d highlights, during a beam search with width2, the CTC decoder keeps only the symbols with the top-2 largestprobabilities at each time step. At t =

0, it keeps A and − . At t = p ( AA ) = . ∗ . = . p ( A −) = . p (− A ) = . p (−−) = .

2. Since AA , A − , − A indicate A , they can be mergedto A . So p ( A ) = . + . + . = .

36. The beam search finds A as the most likely read. To reduce the computing overhead of DNNs, recent work proposesnetwork quantization [18, 19, 38] that approximates 32-bit floating-point inputs, weights and activations to their fixed-point represen-tations with smaller bit-widths. In this way, the quantized networksperform quantized inferences by low-cost fixed-point MACs.

Various NVM-based dot-product engines (e.g., STT-MRAM [39],PCM [1], ReRAM [31]) are used to improve performance per Wattf vector-matrix multiplications by ∼ over conventional CMOSASIC designs. One example of a NVM-based dot-product engine isshown in Figure 5, where the array consists of word-lines (WLs),bit-lines (BLs) and NVM cells. Each cell on a BL is programmedinto a certain resistance ( R ), e.g., cell x on BL is written to R x ,where x = , ,

2. The cell conductance ( G ) is the inverse of thecell resistance ( R ), e.g., cell x has a conductance of G x = R x . Avoltage ( V x ) can be applied to each WL, so that the current, e.g., I x , passing through a cell ( cell x ) to the BL is the product of thevoltage and the cell conductance ( V x · G x ). Based on the Kirchhoff’slaw, the total current (e.g., I ) on a BL ( BL ) is the sum of currentspassing through each cell on the BL, so I = (cid:205) ( V x · G x ) . AllBLs in the array produce the current sums simultaneously withthe same voltage inputs along WLs. In this way, in each cycle, avector-matrix multiplication between the input vector V and theconductance matrix G stored in the array is computed by the dot-product engine. The conversion between analog and digital signalsis necessary for dot-product engines to communicate with otherdigital circuits. A digital-analog converter (DAC) converts digitalinputs into corresponding voltages that are applied to each WL,while an ADC converts the outputs of a dot-product engine, i.e.,the BL accumulated currents, to digital values. Accuracy (%) v o t e r e a d

Speed (Kbp/s)

Figure 7: The accuracy &speed of quantized Guppy. p o w e rare ap o w e rare ap o w e rare a

Value breakdown o t h e r s A D C a r r a yR e R A M

Figure 8: The area andpower breakdown of NVMengines.

Spin Orbit Torque MRAM (SOT-MRAM) [13] emerges as one ofthe most promising nonvolatile memory alternatives to power hun-gry SRAM. To record data, SOT-MRAM uses a heavy metal and aperpendicular Magnetic Tunnel Junction (MTJ) consisting of twoferromagnetic layers separated by a thin insulator (MgO), as shownin Figure 6. A reference layer has a fixed magnetic direction, whilethe magnetic direction of the free layer can be switched by an in-plane current flowing through the heavy metal. When two layershave parallel magnetic direction, the MTJ has low resistance state(LRS) and indicates “0”. In contrast, if two layers are in anti-paralleldirection, the MTJ has high resistance state (HRS) and represents“1”. To write a cell, a write word-line (WWL) is first activated. Whenthe write bit-line (WBL) voltage is larger than the source line (SL)voltage by a threshold, “1” is written to the cell. On the contrary,if the WBL voltage is smaller than the SL voltage by a threshold,“0” is written to the cell. To read a cell, a read word-line (RWL) isactivated, read voltage is applied on the read bit-line (RBL) and theSL is grounded.

Most emerging NVM technologies, e.g., SOT-MRAM [13], PCM [1],ReRAM [31], are generally CMOS-compatible, so they can be in-tegrated with each other and CMOS logic in the same chip. For o r g i n a l1 6 - b i t0 % 2 0 % 4 0 % 6 0 % 8 0 % 1 0 0 %E x e c u t i o n t i m e b r e a k d o w n A l i g n C T C F C G R U C N N

Figure 9: Execution time breakdown of Guppy. instance, a MTJ, i.e., the core of a SOT-MRAM cell, is successfullyfabricated with ReRAM cells in a single chip [41]. Furthermore, themonolithic 3D stacking technology [28] can also integrate variousNVM technologies including ReRAM and STT-MRAM into a 3Dvertical memory array to offer complementary tradeoffs amonghigh density, low latency, and long endurance.

It is challenging to accelerate nanopore base-calling from both algo-rithm and architecture perspectives. If we naïvely accelerate a base-caller using prior network quantization techniques, the quantizedbase-caller greatly increases the number of systematic errors thatcannot be corrected by read voting. State-of-the-art NVM-basedPIMs suffer from huge power consumption and area overhead ofCMOS ADCs, when executing a quantized base-caller. New bot-tlenecks, CTC decoding and read voting operations, emerge in aquantized base-caller, but no prior PIM supports these operations.

We applied the latest network quantization technique, FQN [18],on Guppy to improve its base-calling speed. As Figure 7 shows, theConv, GRU, FC, and CTC layers of Guppy are quantized with variousbit-widths from 4-bit to 32-bit. We executed the quantized Guppyon an NVIDIA Tesla T4 GPU. Although quantizing Guppy with asmaller bit-width, e.g., 4-bit, increases base-calling throughput by2 . × , base-calling accuracy of the quantized Guppy after readsvote decreases by 4.3%, which dramatically jeopardizes the qualityof final DNA mappings. The base-calling accuracy includes twoparts: one is the read accuracy before reads vote; the other is the vote accuracy after reads vote. The base-calling accuracy after readsvote is more important, since read voting operations eliminate allrandom errors and leave only systematic errors. Even the 16-bitquantized Guppy suffers from significant systematic errors thatcannot be corrected by read voting operations. Although prior PIM designs process DNN inferences using ReRAM-[9, 31], PCM- [1], and STT-MRAM [39]-based dot-product engines,the power efficiency and scalability of these PIMs are limited byCMOS ADCs. The in-situ analog arithmetic computing fashionis the key for a NVM-based dot-product engine [1, 9, 31, 39] tosubstantially improving computing throughput of vector-matrixmultiplications. However, as Figure 8 highlights, CMOS ADCs cost82% ∼

85% of power consumption and 87% ∼

91% of area over-head in a ReRAM- [31], PCM- [1] and STT-MRAM [39]-based dot-product engine. Although ReRAM, PCM and STT-MRAM has thecell size of 4 F , 4 F , 60 F , respectively, the power and area of arrayin various NVM dot-product engines are similar, since peripheral K 5K 10K 15K 20KTraining Epoch020406080100 T e s t i n g E rr o r ( % ) loss0 loss1

0K 5K 10K 15K 20KTraining Epoch020406080100 T e s t i n g E rr o r ( % ) loss0 loss1 loss1, = 0 (a) Full-precision model training (b) 8-bit quantized model training Figure 10: The training of full-precision and quantized base-callers with different loss functions. circuits including row decoders, column multiplexers and senseamplifiers dominate power consumption and area overhead of adot-product engine. As a result, CMOS ADCs cost 58% of powerconsumption and 30% of chip area in a typical NVM-based PIMdesign [31]. The power density of recent NVM-based PIMs has al-ready exceeded the memory thermal tolerance even with active heatsinks. Particularly, a 416 W ReRAM-based PIM [9] has the powerdensity of 842 mW / mm , much larger than the thermal toleranceof a ReRAM chip with active heat sinks [42]. CMOS ADCs seri-ously limit the scalability and power-efficiency of state-of-the-artNVM-based PIM accelerators. ...AGGAACGGC...base prob. matrixdecodeG i : networkR i ‐ R i, R i+1 ...AGGAAC......GGAAGG......AACGGC......AGGCACGGC...C i : vote(b) systemic error aware trainingloss l oo k up ...AGGAAGGC...(a) baseline trainingATGC ‐ base prob. matrix ... timestepslookuploss G i : networkR i O i O i+1 O i ‐ Figure 11: Systematic error aware training.

Besides more systematic errors, new performance bottlenecks emergein a 16-bit quantized Guppy. As Figure 9 shows, CTC decoding op-erations consume 16.7% of base-calling latency, while read votingoperations cost 37% of base-calling latency in the 16-bit quantizedGuppy. The Conv, GRU and FC layers in the quantized Guppy heav-ily rely on 16-bit fixed-point vector-matrix multiplications that canbe efficiently executed by a state-of-the-art GPU. Therefore, weanticipate these Conv, GRU and FC layers can be completed by aNVM-based PIM with a shorter latency. In contrast, CTC decodingand read voting operations of a base-caller are not fully optimizedon the GPU. Moreover, no prior PIM design supports CTC decodingor read voting.

To reduce the systematic errors that cannot be corrected by readvotes, we propose

Systematic Error Aware Training (SEAT) thataims to minimize the edit distance between a consensus read andits ground truth DNA sequence by a novel loss function during thetraining of a quantized base-caller.

Baseline training . During the training of a base-caller [7, 36],the gradient is not computed through the edit distance between thepredicted DNA sequence and its corresponding ground truth, sincethe computation of edit distance is non-differentiable. As Figure 11ashows, the Conv, GRU and FC layers generate the base probabilitymatrix by an input signal vector R i . Instead of edit distances, theCTC decoder [7, 33] computes the probability of the ground truthread G i , p ( G i | R i ) , as the loss function by applying Equation 2 onthe base probability matrix. For a training set D , the weights of thebase-caller are tuned to minimize: loss = (cid:213) ( G i , R i )∈ D (− ln p ( G i | R i )) (3)where the more similar to G i the predicted read is, the smaller − ln ( p ( G i | R i )) is. By making each predicted read more similar tothe ground truth, state-of-the-art base-callers indirectly minimizesthe number of random and systematic errors. However, randomerrors can be corrected by read voting operations, whereas onlysystematic errors are the “real” errors that degrade the quality offinal DNA mappings. Systematic-error-aware training . The number of systematicerrors significantly increases in a quantized base-caller. We createdSEAT for the quantized base-caller to minimize the number ofsystematic errors. SEAT is shown in Figure 11b. The base-calleruses multiple input signal data vectors, i.e., R i − , R i , and R i + , togenerate multiple predicted reads, i.e., O i − , O i , and O i + , thatvote to create a consensus read C i . Instead of minimizing the editdistance between C i and the ground truth read G i , we build a newloss function to make C i more similar to G i . For a training set D ,the parameters of the base-caller are tuned to minimize: loss = (cid:213) ( G i , R i )∈ D [− η · ln p ( G i | R i ) + ( ln p ( G i | R i ) − ln p ( C i | R i )) ] (4)where − ln p ( G i | R i ) makes each predicted read more similar to G i ; ( ln p ( G i | R i ) − ln p ( C i | R i )) minimizes the probability differencebetween the consensus read C i voted by multiple predicted readsand G i ; and η ∈ [ , ] is a floating-point constant regulating theimpact of − ln p ( G i | R i ) . The effect of SEAT . As Figure 10(a) shows, we trained a full-precision Guppy by Equation 3 ( loss ) and Equation 4 ( loss ). Ifwe set η in loss to 0, the training cannot converge, since it hasno motivation to improve the accuracy of each read. When we set η to 1, compared to loss , loss slows down training convergence.When the read error rate is high, it is faster to improve the qualityof each read independently. However, two loss functions achieve ref V ref0 V ref1 V ref2 V ref3 R/2 R/2R R R reference voltage generator WW L = V ref0 V ref2 R W L = V ref1 V ref3 S L = WBL =WBL =WBL =WBL =V in Figure 12: The ADC SOT-MRAM array.

Switch prob. (%)

I n p u t v o l t a g e ( V ) 3 V 2 . 9 1 V 2 . 8 2 V 2 . 7 3 V

Figure 13: Input voltage vs.RBL voltage.

Switch prob. (%)

W r i t e v o l t a g e ( V )

Figure 14: Write voltage vs.pulse duration. similar base-calling accuracy at the end of the training of Guppy.Full-precision Guppy is powerful enough to minimize the numberof systematic errors even without read voting operations. In con-trast, the training of 8-bit quantized Guppy with loss and loss isshown in Figure 10(b). For the 8-bit quantized Guppy, comparedto loss , loss increases base-calling accuracy by 6% and obtainsthe same base-calling accuracy as the full precision model. Afterthe systematic error reduction capability of Guppy is damaged bynetwork quantization, loss can reduce the systematic errors forthe quantized Guppy. To reduce area overhead and power consumption of CMOS ADCsin prior NVM-based PIM accelerators, we propose a SOT-MRAM-based ADC array to reliably process analog-to-digital conversions.

ADC array . An example of a 2-bit ADC array is shown in Fig-ure 12. To distinguish 2 bits, an ADC array produces four refer-ence voltages ( [ V ref − V ref ] = [ V , . V , . V , . V ] ) by aMTJ-based reference voltage generator. In the ADC array, all writeword-lines (WWLs) and read word-lines (RWLs) are set to 1, andsource lines (SLs) are set to 0. Input voltages are applied to writebit-lines (WBLs), and reference voltages are assigned to read bit-lines (RBLs). As Figure 13 highlights, due to the spin hall effectand voltage-controlled magnetic anisotropy [17], the write voltagesof SOT-MRAM are different under various RBL voltages. When alarger voltage is applied on the RBL, the SOT-MRAM write voltagereduces significantly. There are four cases, i.e., 1000, 1100, 1110 and1111, when an input voltage writes four cells in the ADC Array. Bya small encoder, these four cases are encoded to 0, 1, 2 and 3. In thisway, the input voltage is converted to a 2-bit digital value. Althougha recent work [4] leverages the MTJ stochasticity to build an 8-bitADC by MTJ, the design relies on CMOS counters and registersthat introduce large power consumption and area overhead. Resolution and frequency . We need to precisely control writepulses in order to enable a higher resolution for the ADC array.There is a trade-off between the resolution and frequency of an ADCarray. Figure 14 shows the switching probability of a SOT-MRAMcell under different voltages and pulse durations. The shorter the

Percentage (%) w r i t e d u r a t i o n ( n s )

Figure 15: Write durationwith F cell size. Write duration (ns) c e l l s i z e ( F ) Figure 16: Worst case dura-tion with varying cell sizes.Table 1: Process variation of SOT-MRAM

Parameter µ σ

WR/RD transistor width ( W wt ) 384 nm L wt ) 192 nm V th ) 0 . V R · A ) 25 Ω · µm A ) 64 nm × nm ∆ ) 22 27%pulse duration is, the higher frequency an ADC array can be op-erated at. With a shorter write duration, a higher write voltage isrequired to reliably switch a cell. Under a fixed maximum inputvoltage, e.g., 3 V , we can distinguish fewer levels of the input voltage(fewer bits) in Figure 13. For a higher resolution under 3 V , a smallerwrite voltage is preferred. In this case, we have to use a longer writepulse duration resulting in lower ADC frequency. To balance thetrade-off, we use a 1 . ns write pulse to switch a SOT-MRAM cellwith 0 . V . In this way, 32 levels of the input voltage, i.e., 5-bit, canbe distinguished. The ADC array can be operated at 640 MHz . Reliability . SOT-MRAM has no endurance issue, since on av-erage a cell tolerates 10 writes [16]. However, process variationmakes a SOT-MRAM ADC array to generate wrong outputs. Therelation between write current I and pulse duration t can be ap-proximated as t = τ e ( − IA · Jc ) ∆ (5)where A is the cross sectional area of the MTJ free layer; J c isthe critical current density at zero temperature; ∆ is the magne-tization stability energy height; and τ is a fitting constant. ∆ isdecided by the MTJ volume. Due to process variation, differentSOT-MRAM cells have different critical parameters including MTJsize, ∆ , write transistor width, length and threshold voltage, therebyrequiring different write pulse durations. We iteratively increasethe write transistor size to guarantee that the worst case cell canbe switched in 1 . ns by considering process variation. To modelthe process variation on SOT-MRAM, we adopted the parametersshown in Table 1 from [25]. In each iteration, we conducted 10billion Monte-Carlo simulations with Cadence Spectre to generatea write duration distribution under a certain SOT-MRAM cell size,which is dominated by the write transistor size. At last, we showthe relation between the worst case cell write duration and the cellsize in Figure 16. We selected 60 F to tolerate process variation andguarantee the worst case cell write duration is 1 . ns . Pipelined dot-product engine . SOT-MRAM ADC arrays canbe easily integrated with prior NVM-based dot-product engines.As Figure 17 shows, the pipeline of a fixed-point vector-matrixmultiplication includes fetching data, MAC, ADC, shift-&-add, and ncoding shift ‐ & ‐ addfetchinginput reg.data MAC op.NVM dot ‐ prdct array ADC op.MRAMADC arrays encodersS&A units storingoutput reg.result ❶ ❷ ❸ ❺❹ Figure 17: The pipeline of a NVM-based dot-product engine. storing result. ❶ During the stage of fetching data, 128 1-bit fixed-point inputs are read from input registers. The 2-bit weights arestored in a 128 ×

128 array of a NVM-based dot-product engine. ❷ ANVM-based dot-product engine converts 1-bit fixed-point inputs toanalog voltages by DACs, and performs 1-bit × ❸ Multiple ADC arrays digitize a MAC result.The NVM-based dot-product engine generates 128 MAC resultssimultaneously. ❹ After encoding, digital values are sent to shift-&-add units to generate final dot-product results. ❺ At last, the finaldot-product results are written into output registers.

ATGC ‐ A ‐ ‐ ‐ A ‐

00 0A S S Figure 18: CTC decoding in a NVM dot-product engine.

CTC decoding . To process CTC beam searches, we rely on a NVM-based dot-product array. Figure 18 shows how to process a CTCbeam search with width of 2. The top-2 largest probabilities of bases(i.e., A and − ) at the time step 1 ( t =

1) in the CTC base probabilitymatrix are written to the diagonal line cells of a NVM-based dot-product array. Since the search width is 2, each probability of abase at t = A and − ) at t = p ( A A ) , p ( A − ) , p (− A ) , and p (− − ) can be concurrently computed. Tosupport the merges of probabilities of multiple-base sequences, weproposed to add a transistor to each BL to connect itself and itsneighboring BL. By closing all transistors ( S ∼ S ), we mergedthe probabilities of four 2-base sequences. In this way, we have p ( A ) = p ( A A ) + p ( A − ) + p (− A ) + p (− − ) . Reliability of NVM dot-product arrays . Since each BL hasonly one base’s probability, the resistance of the transistor we addon each BL is too small to introduce errors in CTC decoding. Sincea NVM dot-product array can operate at only 10MHz [31], the extratransistor does not slow down the dot-product array. However, ourdesign increases writes to a NVM dot-product array. A ReRAMcell stands for 10 writes. A recent ReRAM-based PIM [9] canreliably run back-propagation for 15.7 years. Compared to back-propagation, the Conv, GRU, FC layers and a CTC decoder of abase-caller have much less writes. Based on our estimation, theNVM dot-product arrays of Helix can reliably work for >

20 yearseven when running Chiron having the most complex architectureand the largest number of parameters among all base-callers.

ACTA

CTAG

GAGAT

CTAG(a) longest matchR1:

R2:

R2:R3: ACTA

CTAGACTAGATGAGAT(b) align & voteR1: R2:

R3:con: 001010(c) encodingA: C: ‐ : G: Figure 19: Read voting.Read vote . After a base-caller generates multiple consecutivelypredicted reads, a read vote is required to produce a consensus read.A voting example is shown in Figure 19, where there are three reads,i.e., R =“ACTA”, R =“CTAG”, and R =“GAGAT”. A vote finds thelongest matches between all reads (Figure 19a), aligns reads, andcomputes the consensus (Figure 19b). Finding the longest matchesbetween all reads is the most important operation in a read vote. Tofind the longest match between R and R , all of their sub-stringshave to be compared. As Figure 19(c) describes, we encoded eachDNA symbol by 3-bit. The string match problem is converted tocomparing two binary vectors. We propose a SOT-MRAM-basedbinary comparator array to accelerate binary vector comparisons. RBL =lowLRS HRS R W L = LRSRBL =low RBL =high RBL =high R W L = LRS HRS LRS HRSHRSHRS HRSLRS LRSRBL =highRBL =low00 1 010 10 0 CA C S L S L Figure 20: A binary comparator array.Binary comparator array . We wrote all sub-strings of R , e.g.“ACTA” and “CTA”, into a SOT-MRAM array shown in Figure 20.Each sub-string stays in a row of the array. For instance, “ACTA”is in the first row, while “CTA” is in the second row. We used a2-cell pair in a row to record each bit in the encoding of a DNAsymbol. 0 is represented by a low resistance state (LRS) cell anda high resistance state (HRS) cell, while 1 is indicated by a HRScell and a LRS cell. Therefore, in Figure 20, 6 cells in the first rowindicate the first “A” of “ACTA”, while 6 cells in the second rowrepresent the first “C” of “CTA”. We applied the correspondingvoltages representing a sub-string of R , e.g., “C”, on the RBLs ofthe binary comparator array. Each bit in the encoding of “C” (010) isrepresented by two voltages applied on the two RBLs of a 2-cell pairrespectively, i.e., 0 is represented by low and high voltages, while1 is denoted by high and low voltages. If two DNA symbols arethe same, there is no current accumulated on the SL, e.g., SL . Thesense amplifier can sense a current on the SL, e.g., SL , if two DNAsymbols are different. Unlike alignment and assembly, aligningreads during read voting is easy [33], because the order of thesereads is already known and the length of each read is only 10 ∼ Reliability of binary comparator arrays . To compare two30-base reads, a binary comparator array requires >

180 cells ona RWL. We used the 60 F cell size to build 256 ×

256 arrays asbinary comparators to study process variation. We also adoptedthe same process variation parameters in Table 1. We performed10 billion Monte-Carlo simulations to profile the error rate withrandom 30-base read inputs. The error rate for reading a single able 2: The area and power of Helix

Component Params Spec Power ( mW ) Area ( mm )eDRAM bank num 4 20.7 0.083Buffer capacity 64KBBus wire num 384 7 0.09Router flit size 32 10.5 0.0378Activation number 2 0.52 0.0006S+A number 1 0.05 0.00006MaxPool number 1 0.4 0.0024OR size 3KB 1.68 0.0032 Total ×

128 2.4 0.0002bits/cell 2S+H number 8 ×

128 0.001 0.00004S+A number 4 0.2 0.00024IR size 2KB 1.24 0.0021OR size 256B 0.23 0.00077DAC resolution 1 bit 4 0.00017number 8 × ISAAC Total number 12 289 0.157

ISAAC Tile Total

330 0.372

ISAAC Total tile num 168

SOT-MRAM size 32 × × × Helix Total number 12 122 0.0439

Helix Tile Total

163 0.259SOT-MRAM size 256 ×

256 1.3W 0.11binary cmp number 1024

Helix Total tile num 168 cell is low, i.e., 10 − . After comparing 556 million 30-base reads,on average, our binary comparator array makes 1 mistakes. Webelieve this error rate is acceptable for Helix, since assembly, readmapping, and polishing in the nanopore sequencing pipeline maycorrect systematic errors. For the algorithm modification, our systematic error aware trainingincreased the training time of quantized base-callers by 32% ∼ ∼ Table 3: The architecture of various base-callers

Scrappie Chiron GuppyInput 300 × × × / × ×

96 60 ×

256 150 × × ×

100 150 × × × × × × × × ×

256 SOT-MRAMarrays that cost only 1.3W power and occupy 0.11 mm . We adopted a NVM dot-product engine simulator from [40] andmodified it to cycle-accurately study the performance, power andenergy consumption of Helix and our baseline NVM-based PIMaccelerator. According to a user-defined accelerator configurationand a DNN topology description, the simulator generates the perfor-mance and power details of the accelerator inferring the DNN. Weintegrated the ADC array and binary comparator arrays of Helixinto the pipeline and data flow of the simulator. We implementedour systematic error aware training in base-callers [29, 33, 36] thatare trained on either an NVIDIA Tesla T4 GPU or an Intel XeonE5-4655 v4 CPU.

Base-callers . Oxford nanopore technology had updated its poretype to R9.4. Among all base-callers, only Metrichor [27], Alba-core [26], Flappie [7], Scrappie [29], Guppy [36], and Chiron [33]can base-call R9.4 reads. Metrichor is a cloud-based base-callerhose details are unknown, while Albacore is deprecated by Ox-ford nanopore technology. Albacore has been replaced by its GPU-version successor Guppy and CPU-version successor Flappie. Guppyand Flappie share the same DNN topology. In this paper, we includethree base-callers: Guppy, Scrappie, and Chiron. Guppy and Chironare GPU-based base-callers, while Scrappie can be executed on onlya CPU. We redesigned Scrappie using TensorFlow, so that it canalso be processed by a GPU. The base-caller architectures can beviewed in Table 3. All base-callers share a similar network architec-ture including convolutional, recurrent neural network (RNN), andfully-connected layers. The RNN can be a GRU or Long Short TermMemory (LSTM) layer. Chiron has the most complex DNN topology.Particularly, its convolutional layers have the largest number ofweights, while its RNN is a LSTM layer having more recurrentgates. We assume the beam search width of the CTC decoder ineach base-caller is 10.

Table 4: The dataset for various base-callers.

Sample

Datasets . We used R9.4 training datasets [32] including

E. coli , Phage Lambda , M. tuberculosis and human to train base-callers. Theinput signal is normalized by subtracting the mean of the entireread and dividing by the standard deviation. At the beginning ofeach training epoch, the dataset was shuffled first and then fed intothe base-caller by batch. Training with this mixed dataset enabledeach base-caller to have better performance both on generality andbase-calling accuracy. The datasets for the evaluation of variousbase-callers are summarized in Table 4.

Table 5: The comparison between CPU, GPU and Helix.

Parameter CPU GPU Helixcore mm mm mm TPD 135W 70W 25.7WCache 30MB L3 6MB L2 -Memory 32GB DDR4 16GB GDDR6 32GB NVDIMM

We compared our Helix PIM against the state-of-the-art CPU, GPUand NVM PIM baselines summarized as: • CPU . Our CPU baseline is a 3.2GHz Intel Xeon E5-4655 v4 CPU,which has 8 cores and 30MB last level cache. More details can beviewed in Table 5. • GPU . We selected NVIDIA Tesla T4 GPU as our GPU baseline,since it can support INT8 and INT4 MAC operations. A 1.5GHzNVIDIA Tesla T4 GPU has 2560 cudaCores and a 16GB GDDR6main memory. • ISAAC . We also chose ISAAC [31] as our NVM PIM baseline. Weassumed ISAAC has the same processing throughput of CTCdecoding and read vote without introducing extra power con-sumption and area overhead. By studying the sensitivity of theADC resolution, we compared Helix against two successors ofISAAC including IMP [9] and SRE [40]. • . We quantized base-callers with 16-bit and without sys-tematic error aware training (SEAT) to achieve no obvious accu-racy degradation. The quantized base-callers are ran on ISAAC . • SEAT . We quantized base-callers with 5-bit and SEAT to guaranteeno accuracy loss. The quantized base-callers are ran on

ISAAC . • ADC . We replaced CMOS ADCs of

SEAT by our proposed ADCarrays. • CTC . We used NVM-based dot-product arrays to process CTCdecoding operations for

ADC . • Helix . We used SOT-MRAM-based binary comparator arrays toaccelerate read votes for

CTC . All techniques we proposed in thispaper are accumulated in this scheme.

Accuracy (%) s y s t e m a t i c v o t e r e a d

Figure 21: SEAT on Guppy.

G u p p y Scrap p ie C h iro n

Accuracy (%)

Figure 22: Quant. w. SEAT.

SEAT & quantization . Though naïvely applying the quantizationscheme FQN [18] on base-callers improves base-calling throughput,the number of systematic errors that cannot be corrected by readvotes greatly increases. After we trained Guppy with our system-atic error aware training (SEAT), we can reduce the number ofsystematic errors. As Figure 21 shows, SEAT makes the quantized

Guppy have no accuracy loss by reducing the number of systematicerrors in its loss function, if it is quantized with ≥ Guppy starts to suffer from asignificant number of systematic errors. In this way, SEAT enablesmore aggressive quantization with smaller bit-widths. We showbase-calling accuracy of various quantized base-callers in Figure 22.We find that with 5-bit, no quantized base-caller suffers from ac-curacy degradation. However, with smaller bit-widths, e.g., 4-bit,

Scrappie and

Guppy suffer from obvious accuracy degradation,since they have compact architectures and less parameters. Theparameter-rich Chiron does not decrease its base-calling accuracy,even when quantized with 3-bit. b ase -call d raftp o lish e d b ase -call d raftp o lish e d b ase -call d raftp o lish e d

Accuracy (%)

G u p p y S c r a p p i e C h i r o n

Figure 23: The comparison of base-callers with SEAT. u p p yScra p p ie C h iro n g m e a n bp/s C P U G P U I S A A C 1 6 - b i t S E A T A D C C T C H e l i x (a) Performance

G u p p yScra p p ie C h iro n g m e a n - 2 - 1 bp/s/Watt C P U G P U I S A A C 1 6 - b i t S E A T A D C C T C H e l i x (b) Performance/Watt

G u p p yScra p p ie C h iro n g m e a n - 3 - 2 - 1 bp/s/mm2 C P U G P U I S A A C 1 6 - b i t S E A T A D C C T C H e l i x (c) Performance/ mm Figure 24: The performance, power and area comparison between various accelerators.Quality of final genome mappings . We fed base-called DNAreads generated by quantized base-callers with SEAT into the nanoporesequencing pipeline to evaluate the quality of final DNA mappings.The accuracy comparison of various DNA mappings generatedby both the full-precision, 4-bit, and 5-bit quantized base-callerswith SEAT is shown in Figure 23, where “base-call” indicates theaccuracy of reads generated by base-callers; “draft” represents theaccuracy of alignment produced by read mapping; and “polished”means the accuracy of final read mapping after the polishing step.Compared to full-precision base-callers, the accuracy of reads, cor-responding draft alignment, and final mapping generated by 5-bitquantized base-callers with SEAT has no accuracy loss. However, ifwe quantize the base-callers with 4-bit, the accuracy of base-calledreads, their alignment and final mapping significantly degradeseven with SEAT. Particularly, the 4-bit quantized

Scrappie reducesthe accuracy of the final mapping by 6%. Low quality genome map-pings substantially increased the probability of misdiagnosis andfalse negative testings. Therefore, we used 5-bit to quantize thesebase-callers with SEAT.

Performance, power and area . The performance, power andarea comparison between our CPU, GPU, and NVM-based PIMbaselines is shown in Figure 24. Besides the

CPU and

GPU , we ran theDNN part of full-precision base-callers with 32-bit weights on ourPIM baseline

ISAAC , but left the other parts of base-callers includingCTC decoding and aligning on the GPU without introducing extrapower consumption and area overhead. As Figure 24(a) shows, onaverage,

ISAAC greatly improves base-calling throughput by 25 × and 2 . × over the CPU and GPU, respectively. Among all base-callers, Chiron achieves the largest speedup by running its DNNpart on ISAAC , since 95% of the base-calling time is consumed byits DNN part.

ISAAC improves base-calling throughput of Chironby 7 . × over GPU . ISAAC also increases base-calling throughputper Watt and per mm by 127% and 25 × over GPU respectively, asshown in Figure 24(b) and 24(c). If we quantize base-callers with16-bit, improves base-calling speed by 6.25% over

ISAAC .On the contrary, if we use SEAT to aggressively quantize base-callers with 5-bit,

SEAT improves base-calling speed by 11.1% over

ISAAC without accuracy loss. Although the base-calling throughputimprovement achieved by SEAT is not dramatically significant,SEAT is the key to enabling our power-efficient SOT-MRAM ADCarrays with lower resolution.

Performance per Watt and per mm . Because of SEAT, base-callers can be quantized with 5-bit without accuracy loss. In thisway, we can use our SOT-MRAM-based ADC arrays with lower G u p p yScrap p ie C h iro n gm e an bps/s/Watt S E A T 5 - b i t 6 - b i t A D C (a) Performance/Watt

G u p p yScrap p ie C h iro n gm e an bps/s/mm2 S E A T 5 - b i t 6 - b i t A D C (b) Performance/ mm Figure 25: The comparison against various CMOS ADCs. resolution to reduce power consumption and area overhead of ourPIM accelerator. After we replace the CMOS ADCs in

SEAT by ourSOT-MRAM-based ADC arrays (

ADC ), the PIM accelerator running5-bit quantized base-callers can still achieve the same performanceas

SEAT , as shown in Figure 24(a). However,

ADC significantly re-duces power consumption and area overhead of the PIM accelera-tor. As Figure 24(b) shows, on average,

ADC improves base-callingthroughput per Watt by 127% over

SEAT . Moreover,

ADC increasesbase-calling throughput per mm by 42.9%, as shown in Figure 24(c). Comparison against ADCs with lower resolution . Recentworks rely on CMOS ADCs with lower resolutions, e.g., 5-bit [9]and 6-bit [40], to reduce power consumption and area overheadof NVM-based dot-product engines. The lower resolution a CMOSADC achieves, the smaller power consumption and area overheadit costs. We showed the comparison of performance per Watt andper mm between NVM-based dot-product engines with our ADCarrays and with low-resolution CMOS ADCs in Figure 25. As Fig-ure 25(a) shows, on average, our ADC arrays improve base-callingthroughput per Watt by 27.9% and 37.3% over 5-bit and 6-bit CMOSADCs respectively. Furthermore, on average, our ADC arrays in-crease base-calling throughput per mm by 21.8% and 21.3% over5-bit and 6-bit CMOS ADCs respectively, as shown in Figure 25(b).This is because a 5-bit CMOS ADC has similar area overhead tothat of a 6-bit CMOS ADC. CTC decoding . After we processed CTC decoding operations byNVM-based dot-product engines, as Figure 24(a) show, on average,

CTC improves base-calling throughput by 67.8% over

ADC . Particu-larly,

CTC boosts base-calling throughput of Chiron to 2 . × . More-over, CTC also reduces the data transfers between the GPU and ourPIM accelerator. In

CTC , CTC decoding operations and DNN infer-ences share the same NVM-based dot-product engines, so

CTC doesnot increase power consumption or area overhead. As a result,

CTC improves base-calling throughput per Watt and per mm by 64%and 69% over ADC respectively, as shown in Figure 24(b) and 24(c). u p p yScrap p ie C h iro ngm e an

03 06 09 01 2 01 5 01 8 0 bp/s/Watt

A D C C T C - 1 C T C - 1 0 C T C - 1 0 0 (a) Performance/Watt

G u p p yScrap p ie C h iro n gm e an

03 06 09 0 bp/s/mm2

A D C C T C - 1 C T C - 1 0 C T C - 1 0 0 (b) Performance/ mm Figure 26: The comparison w. varying beam search widths.Sensitivity to beam search width . Figure 26 exhibits the sen-sitivity of base-calling throughput of

CTC with varying beam searchwidths. With an enlarging width of beam search in the CTC decoder,

CTC achieves larger improvement on base-calling throughput perWatt and per mm . This is because, with a larger width of beamsearch in the CTC decoder, the execution time of CTC decodingoperations becomes more and more significant. A NVM-based dot-product engine requires more iterations to process a CTC decodingoperation with larger beam search width. Read voting . By enabling SOT-MRAM-based binary comparatorarrays to process read votes, we have all proposed techniques for

Helix . On average,

Helix improves base-calling throughput by2 . × over CTC , as shown in Figure 24(a).

Helix can concurrentlycompare up to 256 reads by only one binary comparator arrayduring each read voting without introducing significant powerconsumption or area overhead. As Figure 24(b) and 24(c) show,

Helix boosts base-calling throughput per Watt and per mm to3 . × and 3 . × over CTC , respectively. Overall, on average,

Helix achieves 6 × base-calling throughput of ISAAC . Nanopore sequencing . Nanopore sequencing [15] emerges asone of the most promising genome sequencing technologies toenabling personalized medicine, global food security, and virussurveillance, because its capability of generating long reads andgood real-time mobility. In a nanopore sequencing pipeline, the stepof base-calling costs 44.5% of total execution time, because of highcomputing overhead of state-of-the-art DNN-based base-callers. Ittakes more than one day for a server-level GPU to base-call a 3G-bphuman genome with a 30 × coverage by a DNN-based base-caller.This is unacceptably slow particularly during virus outbreaks. Network quantization . Although prior works propose net-work quantization [18, 19, 38] to approximate floating-point net-work parameters by fixed-point representations with lower bit-widths, naïvely applying prior network quantization on base-callersgreatly increased the number of systematic errors that cannot becorrected by read votes, thereby substantially degrading the qualityof final genome mappings.

NVM dot-product engines . Although ReRAM- [9, 31, 40], PCM-[1] , and STT-MRAM [39]-based dot-product engines are proposedin order to accelerate DNN inferences, their power efficiency andscalability are limited by power-hungry CMOS ADCs. CMOS ADCscost 58% of power consumption and 30% of chip area in a well-known ReRAM-based PIM [31]. Another recent ReRAM-basedPIM [9] consumes 416 W and has power density of 842 mW / mm ,much larger than the thermal tolerance of a ReRAM chip withactive heat sinks [42]. Hardware acceleration for genome sequencing . Hardwarespecialized acceleration is an effective way to overcome the biggenomic data problem. However, most prior works focus on onlyaccelerating genome alignment and assembly [34], particular shortread alignment [8, 14, 21, 35, 37, 43]. However, long read alignmentand assembly are not the most-time consuming steps in a nanoporesequencing pipeline.

In this paper, we proposed an algorithm/architecture co-designedPIM accelerator, Helix, to process nanopore base-calling. We pre-sented systematic error aware training to decrease the bit-width of aquantized base-caller without increasing the number of systematicerrors that cannot be corrected through read voting operations. Wealso create a SOT-MRAM ADC array to accelerate analog-to-digitalconversion operations. Finally, we revised a traditional NVM-baseddot-product engine to accelerate CTC decoding operations, andthen introduced a SOT-MRAM binary comparator array to pro-cess read voting operations at the end of base-calling. Comparedto state-of-the-art PIM accelerators, Helix improves base-callingthroughput by 6 × , throughput per Watt by 11 . × , and per mm by7 . × without degrading base-calling accuracy. ACKNOWLEDGMENTS

We thank the anonymous reviewers for their insightful commentsand constructive suggestions. This work was supported in part byNSF CCF-1909509 and CCF-1908992.

REFERENCES [1] S. Ambrogio, M. Gallot, et al. 2019. Reducing the Impact of Phase-Change MemoryConductance Drift on the Inference of large-scale Hardware Neural Networks.In

IEEE International Electron Devices Meeting . 6.1.1–6.1.4.[2] Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, MartinFoltin, R Stanley Williams, Paolo Faraboschi, Wen-mei W Hwu, John Paul Stra-chan, Kaushik Roy, et al. 2019. PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In

ACM International Confer-ence on Architectural Support for Programming Languages and Operating Systems .715–731.[3] Vladimír Boža, Broňa Brejová, and Tomáš Vinař. 2017. DeepNano: Deep recurrentneural networks for base calling in MinION nanopore reads.

PloS one

12, 6 (2017),e0178751.[4] I. Chakraborty, A. Agrawal, and K. Roy. 2018. Design of a Low-Voltage Analog-to-Digital Converter Using Voltage-Controlled Stochastic Switching of Low BarrierNanomagnets.

IEEE Magnetics Letters

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

31, 7 (2012), 994–1007.[6] Nuno Rodrigues Faria, Ester C Sabino, Marcio RT Nunes, Luiz Carlos Junior Alcan-tara, Nicholas J Loman, and Oliver G Pybus. 2016. Mobile real-time surveillanceof Zika virus in Brazil.

Genome medicine

8, 1 (2016), 97.[7] Flappie. 2019. Oxford Nanopore Technologies. https://github.com/nanoporetech/flappie[8] Daichi Fuijiki, Arun Subramaniyan, Tianjun Zhang, Yu Zheng, Reetuparna Das,David Blaauw, and Satish Narayanasamy. 2018. GenAx: A Genome SequencingAccelerator. In

IEEE/ACM International Symposium on Computer Architecture .[9] Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-Memory Data ParallelProcessor. In

IEEE/ACM International Conference on Architectural Support forProgramming Languages and Operating Systems . 1–14.[10] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber.2006. Connectionist temporal classification: labelling unsegmented sequencedata with recurrent neural networks. In

ACM International Conference on MachineLearning . 369–376.[11] Awni Hannun. 2017. Sequence Modeling with CTC.

Distill (2017). https://doi.org/10.23915/distill.00008 https://distill.pub/2017/ctc.12] Thomas Hoenen, Allison Groseth, Kyle Rosenke, Robert J Fischer, Andreas Hoe-nen, Seth D Judson, Cynthia Martellaro, Darryl Falzarano, Andrea Marzi, andR Burke Squires. 2016. Nanopore sequencing as a rapidly deployable Ebolaoutbreak tool.

Emerging infectious diseases

22, 2 (2016), 331.[13] H. Honjo, T. V. A. Nguyen, et al. 2019. First demonstration of field-free SOT-MRAM with 0.35 ns write speed and 70 thermal stability under 400ÂřC thermaltolerance by canted SOT structure and its advanced patterning/SOT channeltechnology. In . 28.5.1–28.5.4.[14] Wenqin Huangfu, Xueqi Li, Shuangchen Li, Xing Hu, Peng Gu, and Yuan Xie.2019. MEDAL: Scalable DIMM Based Near Data Processing Accelerator for DNASeeding Algorithm. In

IEEE/ACM International Symposium on Microarchitecture .587âĂŞ599.[15] Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas ASasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, and Ian T Fiddes.2018. Nanopore sequencing and assembly of a human genome with ultra-longreads.

Nature biotechnology

36, 4 (2018), 338.[16] Jimmy J Kan, Chando Park, Chi Ching, Jaesoo Ahn, Yuan Xie, Mahendra Pakala,and Seung H Kang. 2017. A study on practically unlimited endurance of STT-MRAM.

IEEE Transactions on Electron Devices

64, 9 (2017), 3639–3646.[17] H. Lee, F. Ebrahimi, P. K. Amiri, and K. L. Wang. 2016. Low-Power, High-DensitySpintronic Programmable Logic With Voltage-Gated Spin Hall Effect in MagneticTunnel Junctions.

IEEE Magnetics Letters (2016).[18] Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. 2019.Fully Quantized Network for Object Detection. In

IEEE Conference on ComputerVision and Pattern Recognition . 2810–2819.[19] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed pointquantization of deep convolutional networks. In

International Conference onMachine Learning .[20] Roujian Lu, Xiang Zhao, Juan Li, Peihua Niu, Bo Yang, Honglong Wu, WenlingWang, Hao Song, Baoying Huang, Na Zhu, et al. 2020. Genomic characterisationand epidemiology of 2019 novel coronavirus: implications for virus origins andreceptor binding.

The Lancet

IEEE/ACMInternational Symposium on Computer Architecture .[22] Metrichor. 2017. Oxford Nanopore Technologies. https://metrichor.com[23] Kazuma Nakano, Akino Shiroma, Makiko Shimoji, Hinako Tamotsu, NorikoAshimine, Shun Ohki, Misuzu Shinzato, Maiko Minami, Tetsuhiro Nakanishi, andKuniko Teruya. 2017. Advantages of genome sequencing by long-read sequencerusing SMRT technology in medical area.

Human cell

30, 3 (2017), 149–161.[24] Nanopore. 2020. SmidgION Nanopore Sequencer.https://nanoporetech.com/products/smidgion.[25] Janusz J Nowak, Ray P Robertazzi, Jonathan Z Sun, Guohan Hu, Jeong-Heon Park,JungHyuk Lee, Anthony J Annunziata, Gen P Lauer, Raman Kothandaraman,Eugene J OâĂŹSullivan, et al. 2016. Dependence of voltage and size on write errorrates in spin-transfer torque magnetic random-access memory.

IEEE MagneticsLetters

Proc. IEEE

Briefingsin bioinformatics (04 2018).[31] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M.Hu, R. S. Williams, and V. Srikumar. 2016. ISAAC: A Convolutional NeuralNetwork Accelerator with In-Situ Analog Arithmetic in Crossbars. In

ACM/IEEEInternational Symposium on Computer Architecture . 14–26.[32] Haotian Teng. 2018. Chiron: A basecaller for Oxford Nanopore Technologies’sequencers. https://github.com/haotianteng/Chiron.[33] Haotian Teng, Minh Duc Cao, Michael B Hall, Tania Duarte, Sheng Wang, andLachlan JM Coin. 2018. Chiron: Translating nanopore raw signal directly intonucleotide sequence using deep learning.

GigaScience

7, 5 (2018).[34] Yatish Turakhia, Gill Bejerano, and William J. Dally. 2018. Darwin: A GenomicsCo-processor Provides Up to 15,000X Acceleration on Long Read Assembly. In

ACM International Conference on Architectural Support for Programming Lan-guages and Operating Systems .[35] Y. Turakhia, S. D. Goenka, G. Bejerano, and W. J. Dally. 2019. Darwin-WGA: ACo-processor Provides Increased Sensitivity in Whole Genome Alignments withHigh Speedup. In

IEEE International Symposium on High Performance ComputerArchitecture . 359–372.[36] Ryan R. Wick, Louise M. Judd, and Kathryn E. Holt. 2019. Performance of neuralnetwork basecalling tools for Oxford Nanopore sequencing.

Genome Biology

IEEE International Symposiumon High Performance Computer Architecture . 277–290.[38] Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang,and Hongbin Zha. 2018. Alternating Multi-bit Quantization for Recurrent NeuralNetworks. In

International Conference on Learning Representations .[39] Hao Yan, Hebin R. Cherian, Ethan C. Ahn, and Lide Duan. 2018. CELIA: A Deviceand Architecture Co-Design Framework for STT-MRAM-Based Deep LearningAcceleration. In

ACM International Conference on Supercomputing . 149âĂŞ159.[40] Tzu-Hsien Yang, Hsiang-Yun Cheng, Chia-Lin Yang, I-Ching Tseng, Han-WenHu, Hung-Sheng Chang, and Hsiang-Pang Li. 2019. Sparse ReRAM Engine: JointExploration of Activation and Weight Sparsity in Compressed Neural Networks.In

ACM/IEEE International Symposium on Computer Architecture . 236âĂŞ249.[41] Yu Zhang, Xiaoyang Lin, Jean-Paul Adam, Guillaume Agnus, Wang Kang, Wen-long Cai, Jean-Rene Coudevylle, Nathalie Isac, Jianlei Yang, Huaiwen Yang, et al.2018. Heterogeneous memristive devices enabled by magnetic tunnel junctionnanopillars surrounded by resistive silicon switches.

Advanced Electronic Materi-als

4, 3 (2018), 1700461.[42] Yuxiong Zhu, Borui Wang, Dong Li, and Jishen Zhao. 2016. Integrated ThermalAnalysis for Processing In Die-Stacking Memory. In

IEEE International Symposiumon Memory Systems . 402–414.[43] F. Zokaee, M. Zhang, and L. Jiang. 2019. FindeR: Accelerating FM-Index-BasedExact Pattern Matching in Genomic Sequences through ReRAM Technology. In