A Case for Superconducting Accelerators
Swamit S. Tannu, Poulami Das, Michael L. Lewis, Robert Krick, Douglas M. Carmean, Moinuddin K. Qureshi
AA Case for Superconducting Accelerators
Swamit S. Tannu
Georgia Institute of [email protected]
Poulami Das
Georgia Institute of Technology
Michael L. Lewis
Microsoft
Robert Krick
Microsoft
Douglas M. Carmean
Microsoft
Moinuddin K. Qureshi
Georgia Institute of [email protected]
ABSTRACT
As the scaling of conventional CMOS-based technologies slowsdown, there is growing interest in alternative technologies that canimprove performance or energy-efficiency. Superconducting circuitsbased on Josephson Junction (JJ) is an emerging technology that canprovide devices which can be switched with pico-second latenciesand consuming two orders of magnitude lower switching energycompared to CMOS. While JJ-based circuits can provide high oper-ating frequency and energy-efficiency, this technology faces threecritical challenges: limited device density and lack of area-efficienttechnology for memory structures, reduced gate fanout compared toCMOS, and new failure modes of
Flux-Traps that occurs due to theoperating environment.The lack of dense memory technology restricts the use of su-perconducting technology in the near term to application domainsthat have high compute intensity but require negligible amount ofmemory. In this paper, we study the use of superconducting technol-ogy to build an accelerator for SHA-256 engines commonly used inBitcoin mining applications. We show that merely porting existingCMOS-based accelerator to superconducting technology provides10.6X improvement in energy efficiency. Redesigning the acceler-ator to suit the unique constraints of superconducting technology(such as low fanout) improves the energy efficiency to 12.2X. Wealso investigate solutions to make the accelerator tolerant of newfault modes and show how this fault-tolerant design can be lever-aged to reduce the operating current, thereby increasing the overallenergy-efficiency to 46X compared to CMOS. Our paper also de-velops a workflow for evaluating area, performance, and power foraccelerators built in superconducting technology, and this workflowcan help other researchers explore designs using this technology.
Slowdown in Moore’s law limits the energy-efficiency and perfor-mance that can be obtained with general purpose computers. Tobridge the gap between available performance and application de-mand, system designs are increasingly moving towards buildingapplication-specific accelerators [1, 2]. While accelerators providesignificant performance and energy-efficiency gains, the continuedperformance growth offered by accelerators also gets affected bytechnology scaling. Unfortunately, the marginal improvements inCMOS device density and performance forces us to investigate al-ternative technologies that can provide improved performance and energy-efficiency. Superconducting technology is one such potentialcandidate. However, the technology has several constraints and itis not yet mature to support complex designs. This paper presents acase for accelerators based on emerging superconducting technology.
What is the Technology?
Superconductivity is a physical phenom-enon observed in certain metals that exhibit zero electric resistanceat extremely low temperatures. It can be leveraged to build energyefficient and high-performance switching devices known as Joseph-son Junctions (JJ). JJ devices are used as the basic building blocksfor constructing logic and memory circuits. JJ technology can op-erate at frequencies up to 20 GHz, due to minimal switching delay(2 pico-seconds) of JJs and lossless wires. Moreover, the switch-ing energy of a JJ is about five orders of magnitudes smaller thanCMOS. However, to achieve superconductivity, JJ devices need tobe operated at temperatures close to few Kelvins (typically 4K). Tomaintain such temperature, a cryogenic cooler is used, and suchcoolers typically consume 300W power for every 1W dissipated at4K. Although cooling overhead seems significant, the low switchingenergy of JJ can still enable devices that have 100x lower energyconsumption over CMOS even after accounting for cooling energyoverhead [3]. Thus, JJ technology can provide significant improve-ments in operating frequency and energy-efficiency compared toCMOS. Figure 1: Comparing Josephson Junction (JJ) with 16-nmCMOS, based on parameters derived from [4–6] a r X i v : . [ c s . A R ] F e b hat are the Challenges? Building a JJ based computing systemis challenging. The primary challenge is limited logic and mem-ory density of JJ-technology. For existing process technology with248nm feature size, JJ-device density lags by 1000x as comparedto CMOS [7, 8]. Although JJ density is projected to grow in com-ing years [6], near-term JJ-technology may not be able to close the1000x density gap between CMOS and JJ devices. JJ based logicrequires more devices per gate as compared to CMOS. The higherdevice complexity results from limited output driving capacity of JJs.For example, standard CMOS gates have fan-out of four, whereas,JJ based logic gates can drive at most one output without requiringextra output drivers. These output drivers are known as Josephsontransmission lines (JTLs) and costs 2-JJ devices per JTL exacerbat-ing the density problem. The limited fan-out results in significantlydifferent design trade-offs for accelerators built in superconductingtechnology compared to conventional CMOS. The third challengeis the reliability of JJ devices that is susceptible to magnetic flux-trapping and manufacturing defects. These defects can result inintermittent faults. In this paper, we study near-term JJ technologyfor building accelerators and make the following contributions.
Contribution-1: Study of Superconducting Accelerator:
Giventhe lack of dense memory technology, accelerators built with JJbased technology are likely to be restricted to applications withtiny working set size and high computational intensity, to obtainsignificant performance and energy efficiency improvements overbaseline CMOS accelerators. In this paper, we focus on a SHA-256accelerator used for block-chain applications or bitcoin mining. Wechoose a SHA accelerator due to its simple yet rich design thatenables unique design trade-offs offered by JJ technology. Bitcoinmining application requires repeated computation of the doubleSHA-256 hash for an input message and a 32-bit random key knownas a nonce . A bitcoin miner repeats the SHA computation until itfinds a key that produces a hash with a set number of leading zeros.This repeated evaluation with the same input message but withdifferent keys fit well with the JJ constraints as the compute intensityis exceptionally high with tiny memory footprint. Also, blockchainapplications have a concrete figure-of-merit for both performance(Giga-hashes per Second, or GH/S) and energy-efficiency (Giga-hashes per Joule, or GH/J). Furthermore, existing CMOS bitcoinASICs serves as highly optimized baseline facilitating technology totechnology comparison to evaluate the system-level benefits of JJbased fixed function accelerators. We use Goldstrike 1 [9] bitcoinminer as the baseline CMOS design. Contribution-2: Technology-Aware Design:
JJ-based adders havesignificantly different area and performance trade-offs compared toCMOS designs. For example, by accounting for the fan-out prob-lem, we choose an adder structure that minimizes the overall JTLcount by fusing consecutive additions. Similarly, the baseline de-sign requires per-stage registers to store the temporary values and Bitcoing mining is a competitive industry and the state-of-the-art industrial designsof bitcoin mining accelerators are often kept proprietary by the companies. We useGoldstrike 1 in our evalutions because the implemenation is publicly available andwe can make a fair comparison of CMOS versus JJ technology. Given that the designdetails for state-of-the-art bitcoin mining accelerators, such as Antminer S-9 (16nmASIC [10]), are not publicly available, we are unable to evaluate such designs in JJtechnology. Nonetheless, we do compare the energy-efficiency of our proposal with thepublicly reported energy-efficiency number of Antminer S-9 in Section 5.6 and observea 15x improvement (including the cooling overheads). relies on wide buses, both of which incur significant overheads. Toreduce both JTL and memory overhead, we leverage a predictableregister production and consumption of intermediate variables. In-stead of storing the intermediate variables in register files we useextremely resource and energy efficient delay-lines to synchronizethe producer and consumer stages. Redesigning the accelerator toJJ-technology specific constraints improve the performance by 1.8xand increases the energy-efficiency from 10.6x to 12.4x comparedto CMOS implementation.
Contribution-3: Fault Mitigation and BTWC:
Superconductingtechnology has a significantly different fault mode. The source ofthe fault lies in the operating condition and fundamental property ofsuperconductivity known as Flux trapping. It results in correlatedand large granularity faults. Furthermore these faults have relativelylonger life time i.e they manifest for longer than transients but theyare not permanent. To the best of our knowledge, this is the firstpaper to mitigate faults in JJ technology by leveraging architecture-level solutions, such as redundancy and sparing. To improve thereliability of the proposed accelerator, we design a fault-tolerantSHA-256 engine by provisioning one additional (spare) pipelinestage and a bypassing mechanism that can detect and protect theaccelerator against large granularity faults. Our fault-tolerant designincurs minor storage overhead; however, it can be leveraged toimprove energy-efficiency. For example, in superconducting circuitspower is a product of critical current ( I c ) and operating frequency.The critical current is essential for the correct operation of a circuit.Minimum value of I c ensures a noise margin dictating tolerable errorrate in the logic and memory circuits. This trade-off between I c and error rate can be leveraged to operate the accelerator at Better-Than-Worst-Case (BTWC) operating point by leveraging the faulttolerance circuitry. Such, BTWC design can reduce the I c from38 to 10 micro-amperes, improving the overall energy-efficiencyof the superconducting accelerator to 46x compared to the CMOSimplementation. Contribution-4: Methodology for Estimating Area, Performanceand Power of Superconducting Accelerators:
Estimation of per-formance, power, and area for superconducting logic is difficult dueto lack of automated design tools. Furthermore, standard cells anddesign rules in superconducting logic families are fundamentallydifferent from CMOS technology. For example, logic cells in JJtechnology has limited driving capacity, and to drive more than onecell, a buffer like cell known as Josephson Junction TransmissionLine (JTLs) must be placed between two cells. This limits the directusage of standard CMOS tools to perform a design space explorationfor superconducting technology. To overcome this problem, we useopen-source back-end design tools to incorporate design constraintsspecific to superconducting logic. In addition to the modified de-sign tool, we use analytical models to calculate performance, power,and area for superconducting logic. We introduce a workflow andmethodology to explore design space for accelerators built in su-perconducting logic. Such a workflow can help other researchersin exploring different designs at the architecture level using thisemerging technology. SUPERCONDUCTING TECHNOLOGY2.1 Josephson Junction Device
Few metals exhibit zero resistance to the flow of current at cryogenictemperatures, a phenomenon known as superconductivity . Supercon-ductivity can be achieved by cooling metal wires below their criticaltemperature. For example, Aluminum (Al) and Niobium (Nb) su-perconduct below 1.2K and 9.3K respectively. Superconductivity isleveraged in building a switching device called a Josephson Junction(JJ). (a) (b) (d)(c)
SuperconductorInsulating BarrierSuperconductorInsulating Barrier JJ Magnetic fluxCurrent JJ Magnetic fluxCurrent
Time (pico-seconds)
Figure 2: (a) Josephson junction device (JJ) (b) shunted JJ (c)circuit symbol for JJ (d) Voltage-time curve.
A JJ is fabricated by interposing a thin barrier between two su-perconducting wires as shown in Figure 2(a). This barrier allowsthe electrons to tunnel through it even in the absence of an appliedvoltage. Moreover, the tunneling is controlled by changing the inputcurrent. For example, when the current flowing through the deviceexceeds its critical current ( I c ), a JJ switches from superconductingto a resistive state. Alternately, it goes back to the superconductingstate if the current is reduced below I c . Note that the voltage acrossthe JJ in a superconducting state becomes zero. In a superconducting loop with a JJ, the magnetic flux( ϕ ) is quantizedi.e. it can take only integer multiple values of a single flux quanta(SFQ) ( ϕ ). Magnetic flux ( ϕ ) is the magnetic field per unit area.SFQ is the magnetic flux generated by the tunneling of an electronpair. By switching the JJ between superconducting and resistivestates, the amount of flux in the circuit can be controlled. Presenceor absence of SFQ can thus be used to represent digital information“1” and “0” respectively. When a JJ switches from superconductingto a resistive state, the magnetic flux through the superconductingloop containing the JJ changes by a flux quanta, generating an SFQpulse of about 1 pico-second duration and 2 milli-volt magnitude, asshown in Figure 2(d).In superconducting technology, SFQ pulses facilitate encoding,processing, and transmission of digital information. JJs are almostideal digital switches that are characterized by two basic properties:high-speed switching and ultra-low power dissipation. SFQ pulsescan be as narrow as one pico-second making it possible to clockcircuits at very high frequencies. Superconducting passive transmis-sion lines (PTL) are also able to transmit SFQ pulses with extremelylow losses at 4K. These lossless interconnects and low switchingenergy for JJs (2x − J) enable very low power dissipation.
Reciprocal Quantum Logic (RQL) uses JJ based switches such thata digital “1” is encoded as a pair of SFQ pulses of opposite polarityand a “0” is encoded as the absence of SFQ pulses as shown inFigure 3(a). For details of RQL logic gate design please refer to [3].The RQL family consists of two universal gates, the AND-OR gateand the logical A-AND-NOT-B (referred to as A-NOT-B, as shownin Figure 3) gate that enables the design of complex circuits [11].
Clock (a) (b) L1 L2 L3 L4L5 L6JJ2
JJ1
Clock in Clock out
B A
Output(c) B A C (AB) B A C (AB)
A001 B010 C001 Input B
Input B
Output C
Output C
Input A
Input A
Figure 3: a) Data encoding in RQL and A-AND-NOT-B logicaloperation (b) Logic Table (c) Circuit schematic
Since its introduction in 2011, RQL circuits with 72,800 JJs havebeen demonstrated. Other demonstrated circuits include shift reg-isters, small arithmetic circuits, transmission driver systems, andserial data receiver systems [12–15]. Design and resource estimatesexist for 32-bit and 64-bit integer and floating-point arithmetic andlogical units, register file, on-chip storage components, bloom fil-ters [4, 16, 17].
Table 1 compares the energy-efficiency of typical operations in
CMOS and superconducting technology. To report energyfor , we use established ITRS scaling factors since we lackopen source synthesis libraries [18, 19]. We observe thatmemory operations are less energy efficient compared to arithmeticand logic operations for JJ technology. Furthermore, building mem-ory takes more area in JJ technology as there is no dense memorysolution like SRAM or DRAM currently available in the supercon-ducting domain. Researchers are exploring superconducting memorysolutions such as hybrid-JJ-CMOS memory, Josephson magneticrandom access memory (JMRAM), but their capacity is likely toremain severely limited compared to conventional technologies. Thelimited device density and costly memory operations and capacityconstrains the potential applications to computationally intensiveapplications that have small working sets. We explore the design ofsuperconducting technology for one such application.
Table 1: Energy/op comparison of 16-nm CMOS and JJ-logic( including cooling overhead of 300x)
Parameter CMOS-16nm JJ at 4K Improvement64bit-Add 0.592 pJ 0.06 pJ 9.86x64bit-Multiply 2.367 pJ 0.248pJ 9.54x64bit-RF-Load 0.050 pJ 0.05 pJ 1xOff-chip Interconnect 300 pJ 8.6 fJ 30000x300K to 4K link - 3712 pJ/bit NA SUPERCONDUCTING ACCELERATOR
Superconducting circuits offer high energy efficiency. However, withlimited device density and memory capacity, designing supercon-ducting general purpose computers is incredibly challenging. Fur-thermore, lack of sophisticated design tools exacerbate the densityproblem as existing CMOS-based synthesis tools used for JJ designcan not maximize device utilization. We believe that both of theproblems are related to design and manufacturing economics ratherthan being fundamental challenges. However, until the technol-ogy reaches the maturity to manufacture and test billion JosephsonJunctions per cm , which is typically required for general-purposecomputing, we can leverage the technology to build accelerators.Applications with tiny working set size and high computationalintensity are ideally suited for JJ-based accelerators. We study theSHA-256 application for building accelerators using JJ technology.We provide an overview of the application, the baseline CMOSimplementation, and our JJ-based implementation. We optimize theJJ design for performance (in Section 4) and reliability (in Section 5).We use the methodology and workflow described in Section 6 forevaluations using the JJ technology. A blockchain is a decentralized public ledger of transactionsthat maintains the validity of transactions by a distributed consensusmechanism [20]. In bitcoin, the process of authenticating transac-tions in this public ledger is called mining . It involves searchingfor a 32-bit key known as nonce value such that when combined withthe message which lists the transactions, the double SHA-256 hashof the block (message + key) falls within a certain range. Whenevera miner finds a block i.e. 32-bit key (nonce) for an input messagethat leads to desired hash, the miner is rewarded with bitcoins. Theoverall process for bitcoin mining is captured in 4(a). A bitcoin
512 bit message Message
Scheduling Unit (MSU)
Compression Function
Generator (CFG)
64 rounds
SHA-256 Engine
Intermediate Hash Collector (IHC)
256 bithash
Message32 bit key Message32 bit key H < T Try different key H Bitcoin
Network Hash Threshold (T) No Yes(a) (b)Bitcoin-miner
BitcoinsAwarded
Figure 4: (a) An overview of Bitcoin Mining (b) Overview of theSHA-256 Algorithm miner tries to maximize profit by trying multiple keys in paralleland as fast as possible as the probability of finding the key andgetting rewarded is directly proportional to the total hash rate .However, repeated SHA-256 computation requires substantial powerdue to high computational intensity. The net profit depends both onthe reward and operating costs (energy consumption [21]). Hence, energy-efficiency (in GH/J) is the figure-of-merit that is optimized to increase profits. For this reason, bitcoin mininghas evolved from CPUs to GPUs to FPGA and finally to ASIC-basedimplementations in the last decade [22, 23].
The SHA-256 computation of a message is carried out as shown inFigure 4(b). The message scheduler unit (MSU) takes an incomingmessage and splits it into 512-bit chunks. The MSU schedules adifferent 32-bit data to the compression function generator (CFG)every cycle, consuming 512-bit data over 64 rounds. The CFG usesthis data and predefined constants to generate a 256-bit intermediatehash after every 64 iterations which is collected by the intermediatehash collector (IHC). When the entire message is processed, thevalues in the IHC registers is the final 256-bit hash.
Bitcoin mining ASICs are available commercially from differentvendors today. Furthermore, the state-of-the-art ASICs are fullycustom designed at 16 nm or lower technology nodes and implementseveral design and algorithmic optimizations to increase the through-put (GH/s) and energy efficiency (GH/J). However, bitcoin mining isa competitive industry and the designs of state-of-the-art industrialaccelerators are often kept proprietary. In order to make a technol-ogy comparison for the same accelerator design, we use the publiclyavailable Goldstrike1 [9] miner as the baseline for our studies (wecompare our proposal with the publicly reported energy-efficiencyfor 16 nm AntMiner S9 in Section 5.6).A hash engine contains two instances of the SHA-256 computa-tion blocks. SHA-256 algorithm uses 64 iterations, which can bepipelined. In Goldstrike1, these iterations are fully unrolled for boththe rounds that eventually lead to a 128-stage pipeline. Each pipelinestage comprises a compression function generation (CFG) logic anda message scheduling unit (MSU). The hash collector compares theoutput hash with the target to be achieved and if the criterion ismet, it sends the result to the host.
We propose a superconducting blockchain accelerator that operatesat 4K temperature and communicates with a host at room temper-ature. The architecture of our hash engine is shown in Figure 5(a).The host receives the incoming messages from the network andoffloads them to the accelerator. The accelerator computes hashesfor different nonce values and it sends a message to the host whenthe network target is met. We port the CMOS Goldstrike1 design tosuperconducting logic without any optimization.SHA-256 algorithm requires computation using predefined con-stants. In our fully pipelined design, each pipeline stage requires adifferent fixed 32-bit constant for the computations, which are tied-off in the superconducting design to save on resources. The rotationsand shifts in the SHA-256 computation involve fixed rotate/shiftamounts. So the design does not implement any actual rotator orshifter logic but requires the signals to be routed appropriately.
Figure 5(a) shows an overview of our JJ-based implementation ofGoldStrike1, which is designed by simply porting the CMOS-basedimplementation to JJ-based implementation. Based on our method-ology described in Section 6.1, we compute the area (measured inJJ-complexity) for this design. The baseline design incurs significantJTL overheads (buffers that are required to facilitate fanout). essageResultHost (300 K)JJ based Accelerator(4 K)
CFG MSUHash CollectorPipeline
Stage 1Pipeline Stage 2Pipeline Stage . .. Input . .. . .. MSU
CFG
MSUCFG
Baseline JJ based SHA Accelerator (Section 3) JJ Tech-aware SHA Accelerator (Section 4)
JJ aware Design JJ C o m p l e x i t y JJ C o m p l e x i t y Adder JJ ComplexityRegister JJ
Complexity
Adder JJ
Complexity
Register JJ ComplexityBaseline JJ aware Design JJ C o m p l e x i t y Adder JJ ComplexityRegister JJ
Complexity
Adder JJ
Complexity
Register JJ ComplexityBaseline 2 input Tree
Adders (32 bit)PipelineRegisters ( 32 x 16 )
Fault Tolerant SHA Accelerator (Section 5)
Baseline Compression
Function Generator (CFG) Optimized Compression
Function Generator (CFG)
Optimized Pipeline Registers with Delay Lines ( 32 x 4 )
Stage 1
Stage 3 : : Stage 2 Stage 129Stage 1Stage 3 : : Stage 2Stage 128
Stage 128
Error
Bypass logic
Bypass logicBypass logic : : Incorrect
Output hash(Baseline SHA) (Fault Tolerant SHA) Error
Correct
Output hash E n e r g y E n e r g y E n e r g y Decreasing Critical Current (Ic)
Worst case design
Better than worst case design E n e r g y Decreasing Critical Current (Ic)
Worst case design
Better than worst case design
Figure 5: Design of a JJ-based Accelerator for SHA-256 based on GoldStrike 1 (a) Host at 300K communicates with accelerator at4K (b) Technology-aware design (c) Fault tolerant design
For the analysis of JTL overheads, we study the distribution offanout in our design. Figure 6 shows the distribution of fanout ina single stage of our hash engine. Thus, a gate drives on averageabout 1.5 gates, requiring 50% additional JJs for fanout, incurringsignificant area overheads.We perform a design space exploration to best meet the require-ments of superconducting technology and present our results for atechnology aware design of the superconducting SHA accelerator inSection 4 (as shown in Figure 5(b)). Reliability is a key challenge insuperconducting logic circuits and we present a case for a reliable,fault tolerant SHA accelerator in Section 5 (as shown in Figure 5(c)).
In our design, 128 different values of nonce are processed in thepipeline and a hash is generated every cycle once the pipeline is full.The critical path in our design comprises of four adders in the CFG.We report the hashrate, power, and energy-efficiency in Table 2 forthe accelerator using the methodology described in Section 6 for twodesign points, with ripple carry adders (RCAs) and Kogge-Stoneadders (KSAs). An RCA is 3x more energy-efficient than a KSAbut a KSA has 30% lower latency. This enables us to compare twodesign points, one that is optimized for energy-efficiency and anotherthat is optimized for performance. For the high performance design,KSAs are used economically since they are expensive in terms ofresources. They are used only to optimize the speed-path and thenon-critical path adders are still designed to be RCAs. Table 2 alsocompares the performance and energy-efficiency of the GoldStrike 1accelerator designed with superconducting logic using the baselineCMOS-based architecture for the two different design points.
Fanout F r e q u e n c y o f f a n o u t ( l o g s c a l e ) Figure 6: Distribution of fanout in a single pipeline stage ofbaseline SHA accelerator Table 2: Performance and Energy Evaluations for SHA acceler-ator implemented in CMOS and JJ
Parameter GoldStrike 1 JJ-Design JJ-DesignCMOS only RCA with KSATechnology 16 nm 248 nm 248 nmJJ Complexity (million) N/A 3.38 5.54Hashrate (GH/s) 1.05 0.661 0.951Total Power (milli-Watt) 250 15.65 36.23
Energy-Efficiency 4.0 42.26 26.24(GH/J) (1x) (10.6x) (6.56x)
The JJ-based design that is implemented with only Ripple-CarryAdders is 10x more energy efficient than the CMOS implementa-tion, however it has 37% lower performance. Using Kogge-Stoneadders reduce the energy-efficiency to 6.5x while bridging the per-formance difference to within 10%. We observe that our designenergy-efficiency reduces by almost one-third for design optimizedwith KSAs, indicating that optimizing only for high-speed can bedetrimental to the overall energy-efficiency. However, both designsshow that simply porting the accelerator from CMOS to supercon-ducting logic can provide significant energy-efficiency improvement.The contribution towards JJ-complexity for our hash engine comesfrom adders, registers and other logic. Table 3 shows the contribu-tion towards JJ-complexity from each of these three sources. Weobserve that about 50% of the contribution towards JJ-complexity isfrom adders for an RCA-based design and this increases to 67.7%for KSAs. Optimizing the accelerator design to suit the specific con-straints of the JJ technology can further improve energy efficiency.We discus technology-aware optimizations in Section 4.
Table 3: Breakdown of JJ-complexity
Design Adders Registers Other Logic Total(million) (million) (million) (million)With RCAs 1.69 (50.1%) 1.51 (44.8%) 0.17 (5.1%) 3.38With KSAs 3.75 (67.7%) 1.51 (27.3%) 0.28 (5.0%) 5.54 TECHNOLOGY-AWARE DESIGN
In this section, we discuss the impact of JJ technology on design andarchitectural decisions. To illustrate the contrast between CMOSand JJ designs, we focus on two critical components of the SHAengine: adders and registers. We also discuss a way to optimizecommunication for the accelerator.
The proposed SHA engine uses 1200 adders, which accounts formore than 50% of JJ-complexity. Furthermore, the clock frequencyof the SHA engine is dictated by the critical path that consists of fouradditions in CFG unit as shown in Figure 5(b). Adders dominate theon-chip resources and overall latency. Thus, optimizing the addersto improve critical path and overall energy efficiency is essential.Typically CMOS adder designs improve latency at the expense ofmore transistors or complex connectivity. Although, the complexityof the adder design increases, the delay and energy efficiency alsoimproves. For example, a complex
Kogge Stone Adder (KSA) isfaster and more energy efficient as compared to simple
Ripple CarryAdders (RCA) . However, JJ based adders do not follow the sametrends. For instance, tree based adders rely on complex communi-cation patterns to improve the critical path from O ( N ) to O ( loд N ) .However, to enable tree based adders, we need greater fan-out andcomplex wiring, both of which have low overhead in CMOS. How-ever, the limited fanout of RQL gates require JTLs leading to highJJ complexity. For example, JJ based KSA improves performance,but it worsens the energy-efficiency [4, 24]. While designing theSHA engine, a combination of adders can be selected such that ourdesign meets the baseline CMOS performance and maximizes over-all energy-efficiency. To satisfy these criteria, we choose differentdesign combinations of KSA and RCA as shown in Figure 7. Table 2 show that merely replacing RCAs with KSAs improve thecritical path but degrades the energy- efficiency. Furthermore, evenafter replacing all four critical path RCAs with KSAs, JJ basedaccelerator fails to meet the baseline delay. Our goal is to meet thecritical path requirement without deteriorating the energy efficiency.Thus we try to optimize our design such that JTL overheads arereduced, and simplicity of RCA is maintained. We observe thatmajority of the additions in the CFG of Figure 5(b) are back toback additions and most intermediate addition results are not usedelsewhere. Thus, it is possible to replace some of these adders by asequence of carry save adders (CSA). An n-operands CSA computesthe composite addition much faster as compared to ripple carryadders. If δ F A is the delay of a 1-bit full adder (FA), the latency ofan N number addition with CSA that can add
N k -bit numbers isgiven by
Latency
CSA = ( K + N − ) δ F A .CSA has lower fanout and does not requires routing between dis-tant gates, making it more layout friendly. When two back to backadders on the critical path of the CFG and the MSU are replaced bya 3-op CSA, the design has 1.2x the performance and is 1.25x moreenergy-efficient than our baseline design, even after accounting for20% skew. Hardware optimizations have been proposed in the pastto move the addition of variables W i and K i from MSU to CFG [25].We propose a similar optimization where this value is pre-computed R C A K S A + R C A K S A + R C A K S A + R C A K S A - o p C S A - o p C S A Critical Path Adder Design Points L a t e n c y , e n e r g y , e n e r g y - d e l a y n o r m . t o a ll R C A s b a s e li n e d e s i g n Critical path latencyEnergyEnergy-Delay
Figure 7: Latency, energy, and energy-delay product for dif-ferent critical path adders designs normalized to Ripple CarryAdder (RCA) parameters in the ( i − ) th stage of the pipeline and consumed in the i th stage.This allows us to use 4-op CSA in both CFG and MSU blocks andfuse 3 additions, thereby reducing the overall critical path. Thisdesign offers 1.67x the performance improvement for RCA baselinedesign and is 1.44x more energy-efficient. Table 4 lists the perfor-mance and energy-efficiency of the superconducting hash enginefor the different adder optimized designs against the baseline designusing all RCAs. A similar optimization uses multiple such CSAin parallel for a high-speed SHA-256 ASIC design besides carry-lookahead adders [26]. However, we use these CSAs in conjunctionwith energy-efficient RCAs to have a more economical design interms of JJ-complexity. In the baseline pipeline design, in each stage MSU uses 16 registerswith 32-bit width, and CFG uses 8 registers. This results in about35% JJ complexity for an optimized adder circuit. The registershold the input values and the intermediate results. The contents ofthe registers are consumed by adders and other logic to produce anoutput in every stage, and subsequent stages consume the producedoutput of the stage. A baseline design replicates all the 24 registersat every stage requiring large JJ complexity. Furthermore, all theregister values from i th stage to ( i + ) th stage are expected to flushevery clock cycle. We would need a wide bus to flush the contentsof registers every cycle. wide bus and a small set of registers aretrivial in CMOS. However, in JJ technology, JTL cost of wide busesand registers lead to high costs.In traditional non-pipelined SHA-256 design a global registerholds all the intermediate values. Whereas, in the quasi-pipelinedSHA designs (SHA engine with 4 stage pipeline) each pipelinestage uses a local register [27]. The local register file enables ahigher clock rate. Whereas, shared registers improve the critical pathsignificantly, especially for heavily pipelined designs requiring datavalues every cycle. So, to supply register values each uses a localset of registers leading to high JJ complexity (registers account for35% JJs).The baseline design has fixed control path, and identical opera-tions are performed in every stage of the pipeline. Each pipeline a) (b) Working Registers
Datapath N th Stage
Datapath
Working
Registers
Storage
RegistersStorage Registers
Storage
Registers (N+1) th Stage (N+2) th Stage Working Registers
Working Registers
Datapath N th Stage
Datapath
Working
Registers
Storage
RegistersStorage Registers
Storage
Registers (N+1) th Stage (N+2) th Stage Working Registers
Working Registers
Datapath N th Stage Datapath
Working
Registers (N+2) th Stage
Working Registers
Delay
Line (N+1) th Stage
Figure 8: (a) Basic design needs 16 registers (storage + workingregisters) in MSU per stage (b) Optimized design with delay-line reduces it to 4 working registers stage produces an output that is consumed in the next set of stages.For example in MSU, only four registers are consumed by 4 input32-bit adder in each stage of the pipeline, to produce one 32-bit out-put. After that, all the registers are simply copied to the next stage,such that N th register of the current stage is copied to ( N + ) th register of the subsequent stage as shown in the Figure 8(a). Thusonly one register value is produced, four values are consumed andrest of the register values are copied as is to the next stage. We canleverage this deterministic production and consumption of the valuesto eliminate the large fraction of registers.An alternative to communicating between stages using registersis to connect producer and consumer via a Delay-Line . Delay linememory is a form of memory used in earliest computers during1960s [28, 29]. Unlike modern day random access memories, adelay line memory is based on sequential access and requires to berefreshed from time to time. Such memories rely on transmittinginformation through a circuitry that adds delay and re-routing theend of the delay path to the input end such that information canbe transmitted continuously through the closed loop as shown inFigure 9(a). We propose to delay lines to route data from producerregister to consumer register in a synchronized manner by using theprecise number of delay elements to match the desired delay.In RQL, a delay line can be built using a series of JTLs that repeatsignals for every clock activation. On a JTL delay line, input data ispropagated from one JJ to next JJ every clock phase. This providesan efficient way to leverage the producer and consumer patterns in ahash engine to reduce JJ complexity. Delay lines keep the data inflight and deliver to the consumer at the precise clock cycle. Since
Delay Element Delay Element (a)
Delay Element (a)
JTL JTL JTL
Delay Element ConsumerRegister
JTL JTL JTL
Delay Element ConsumerRegister(b)
JTL JTL JTL
Delay Element ConsumerRegister(b)Producer RegisterProducer Register
Figure 9: (a) Delay Line Memory (b) JTL based delay line candelay and forward data from one stage to other Table 4: Performance and Energy after Optimization
Parameter RCA KSA 4-CSA 4-CSA + RegAdder Adder Adder optimizationJJ Complexity (million) 3.38 5.45 3.57 2.89Hashrate (GHz) 0.661 0.951 1.101 1.101Total Power (mW) 15.64 36.22 27.5 22.26Energy Efficiency (GH/J) 42.27 26.24 40.05 49.47
Efficiency wrt CMOS-16nm 10.56x 6.39x 10.0x 12.37xHashrate wrt CMOS-16nm 0.63x 0.90x 1.05x 1.05x the delay line can simply load a new value every clock cycle, it canbe integrated seamlessly with the proposed pipeline design.A delay line can facilitate the delivery of intermediate resultsfrom a producer stage to a consumer stage. The cost of delay linememory is 4 JJs per clock cycle per bit whereas register storagerequires 12 JJs per bit. Although the crossover-point for the flopbased register file is 3 clock cycles, the delay line memory enablespoint to point connection between the producer and consumer thateliminates the need for 16 registers for every stage. We use fourstaging registers along with the delay line design to tolerate clockskew. The delay lines reduce the per stage JJ cost by almost 20%.Furthermore, it simplifies the bus design.
Table 4 shows the performance and energy-efficiency of our baselineJJ-based accelerator (with RCA/KSA) and with technology aware op-timization of 4-operands CSA (four-input) and use of delay-lines toreduce register cost. The 4-operands CSA optimization improves theenergy-efficiency from 6.39x for KSA to 10.0x, while also improv-ing the performance by 15% (bringing it in line with the performanceof the CMOS-based implementation). The delay-line optimizationreduces register file costs and improves energy efficiency from 10.0xto 12.4x, while still having similar performance.
In this section, we discuss fault models for JJ technology, and presenta design that can use architecture-level solutions to protect againstthese faults. We also discuss how the proposed fault-tolerant designcan be leveraged to improve energy efficiency by operating thecircuit at a Better-Than-Worst-Case (BTWC) design point.
There are three primary sources of faults in superconducting logic:Fabrication defect, device level variations, and non-ideal operat-ing environment. Fabrication defects result from the material andmasking defects introduced during fabrication. These defects canmanifest as permanent stuck-at-faults, similar to birth-time defectsin CMOS, and can be mitigated by design time testing. Deviceparameter variation can cause degradation in noise margins. Forexample, variation in critical current ( I c ) can cause degradation innoise margin resulting in timing errors and the design must operateat a point where it is robust against such variations.In JJ technology, flux trapping causes a unique source of faultswhich we term the operating environment fault . These faults arechallenging to protect against due to correlated nature of the errors. tage 129 :: Stage 128
Bypass logic (b) (c)
Correct Output hash
MUX 2:12:1 2:12:1 2:12:1 2:12:1
A1 A2 A3 A4B1 B2 B3 B4(e) (d)Base mux design
Error in mux propagate through pipeline
Stage 1Stage 3 : ::: Stage 2
Stage 128
Erroneous Output hash (a) A1 A4 A3 A22:12:1 8:12:12:1 2:12:1 2:12:1
A1 A2 A3 A4B1 B2 B3 B4 Redundant mux can tolerate one fault
B1 B2 B3 B4A1 A4 A3 A22:1 8:12:1 2:1 2:1
A1 A2 A3 A4B1 B2 B3 B4 Redundant mux can tolerate one fault
B1 B2 B3 B4
Stage 1Stage 3
Stage 2
Bypass logic
Stage 1Stage 3
Stage 2
Bypass logic
ErrorError
Bypass logic :: ErrorError
ErrorErrorErrorError
Stage NStage N +1
Registers Datapath Registers Datapath
Registers Datapath Registers Datapath
Figure 10: (a) Baseline design with no redundancy (b) Reliable design to mitigate correlated faults using sparing stage and bypasslogic (c) Logic for bypass is vulnerable to faults (d) Mux design for the selection logic for sparing technique (e) Reliable design usingenhanced bypass logic with redundant mux that can tolerate at the most two faults.
Furthermore, the faults are neither a permanent fault nor a transientfault, and it manifests not only because of the device but also due tonon-ideal operating conditions. Flux-trapping results from trappingof stray magnetic field in the JJ circuits due to non-uniform coolingand can result in non-functional circuits or reduced noise margin forparts of the chip. Fortunately, steady progress and innovations infabrication and device technology limits the problem of flux trappingconsiderably [13]. The reported flux-trapping solutions are demon-strated on 50K JJ circuits. However, the techniques are costly, hard toscale to large systems, and do not completely eliminate the problemof flux-trapping. For example, some of the demonstrations use activemagnetic field cancellation or extremely low temperature (¡1K) atwhich flux vortex freezes, both of the additional requirements areexpensive, especially for large scale systems.
To understand the impact of faults on the output of the SHA engine,we use fault injection to quantify the Architectural Vulnerability Fac-tor (AVF). For the baseline design, injection of faults shows 98.89%AVF. The high AVF of SHA engine results from the entropy max-imization property of the algorithm where a single bit operationalerror can corrupt the output. Protecting a SHA engine is a tradi-tionally non-trivial problem due to its cryptographic properties andtight area and energy constraints. Techniques based on replicationor parity detection circuits are either too complex and expensive orprovide partial protection against faults.
Transient faults do not have any meaningful impact on the miningprocess and hash-rate as transient errors can corrupt only one ofthe key combinations. The probability of a miner missing out ona reward due to a transient fault is extremely small. For instance,the probability of finding a block is relatively low ( ), and if theprobability of transient fault is P , then collision of those two eventsis significantly lower ( P ). Recent proposals take advantage of thisproperty to enable approximate bitcoin mining [30]. On the other hand, permanent faults would result in non-functionalSHA engine thus reducing the yield significantly. Furthermore, ifnot detected before deploying, the miner would simply consumepower without doing any work. This problem is significantly worsefor the flux trapping faults as fault patterns change every warm upcycle which forces us to test SHA engines after every cool-down.In CMOS, non-functional chips can be isolated by post fabricationtests. Whereas, in JJ circuits, faults can happen not only because offabrication defects but also due to operating conditions. The correlated nature of faults due to large granularity impact offlux traps limit the ability to use standard low-cost protection tech-niques to protect against single-bit faults that happen in conventionaltechnologies. Our goal is to leverage the regular structure of theaccelerator to improve the reliability of the JJ based SHA-256 enginewithout significant complexity.For the pipelined SHA-256 accelerator, all the stages in thepipeline are functionally identical. Furthermore, all the stages havedeterministic control and data-path. This can be leveraged to en-able low-cost fault tolerance. We propose to add an extra pipelinestage and build a bypass logic between consecutive pipeline stagessuch that if a fault is detected for a pipeline stage that stage canbe bypassed as shown in Figure 10(b). The bypass logic and sparepipeline stage can be used to detect the faulty pipeline stage. Afaulty stage can be detected by bypassing the stages one by onewith a standard input and output pair until a right hash is produced.While we describe the solution with one spare pipeline stage, thesame bypass network can be used to mitigate N fault stages in theaccelerator by using N spare pipeline stages.The bypass logic is placed between all 128 stages and it con-sists of four 32-bit 2:1 multiplexers as shown in Figure 10(c). Themultiplexers can bypass the faulty stage and re-route the signals tosubsequent working stages. The multiplexers are essential for rout-ing signals from one stage to another even in the absence of faults asthey are placed between two stages as shown in Figure 10(b). Fault .0 0.2 0.4 0.6 0.8 1.0 Probability of gate level fault
1e 7 P r o b a b ili t y o f S y s t e m F a il u r e Unreliable Baseline Design with Stage SparingDesign with Redundant Mux
Figure 11: Probability of System failure with respect to proba-bility of gate level fault due flux trapping for (a) baseline with noredundant structures (b) design with a spare stage and bypassselection logic (c) spare stage and redundant muxes on any of the multiplexers results in a non-functional SHA engine.However, multiplexers cover only a small fraction of total area andthe likelihood of a fault occurring on any of the multiplexers is anorder of magnitude less compared to other functional units. Thus,this design enables partial fault-tolerance as it can function correctlyas long as faults do not occur on any of the multiplexer blocks. Toevaluate the effectiveness of the design, we perform binomial trialsassuming identical and independently distributed (iid) errors. Inthe baseline, even a single fault can lead to system failure whereas,sparing design build some fault-tolerance. To improve the reliabilityeven further, we propose a spare stage design that uses a redundantmultiplexers for bypass circuitry. As shown in Figure 10(e), theredundant multiplexer can tolerate one fault on any of the four mul-tiplexers by using an extra 8:1 mux. The design with redundant muxcan tolerate one fault anywhere. Figure 11 shows the probability ofsystem failure for the baseline, stage-sparing, and stage-sparing withredundant muxes. The design with sparing and redundant mux is 5-6orders of magnitude more reliable as compared to the baseline.
The energy-efficiency and performance of the superconducting cir-cuit is determined by the critical current (Ic). We can reduce theenergy consumption by reducing Ic; however, this can cause certaindevices to fail. Therefore, the critical current is set conservativelysuch that none of the devices fail. Recent studies suggest that the I c distribution for future technology nodes may have a large spreadbetween devices, leading to as much as 5x difference between theaverage I c and worst-case I c [31]. The heavy tail on I c distributionmay force designers to pick I c conservatively. However, we canleverage the proposed fault-tolerant design to tune the optimal I c by using a better-than-worst-case (BTWC) design philosophy. Theproposed reliable SHA-engine design can be used to tune the I c as itcan protect against a large granularity failure using a spare pipelinestage. To perform the run-time tuning, a I c value is lowered until afailure is observed. With the fault tolerant design, a weak pipelinecan be detected and bypassed. If a fault can not be isolated, in thatcase the I c is increased. The tuning enables optimal I c by isolatinga weak/faulty pipeline stage, and mitigation of the fault. This canreduce the I c from 38 µA to 10 µA (based on conservative scalingmodel of I c of Herr et al. [3]). Figure 12 shows the energy-efficiency of JJ-based designs, all nor-malized to the baseline CMOS implementation of Goldstrike 1. Forreference, we also show the published numbers for commercialASIC, Antminer-S9, which provides only a 3x improvement in en-ergy efficiency. A basic design that simply translates the CMOSimplementation of Goldstrike to JJ technology provides 10x improve-ment. Redesigning it for technology-specific constraints (fanout,efficient communication) improves the energy efficiency to 12.4x.To enable fault-tolerance, a proposed fault-tolerant design with onespare pipeline stage has overall energy efficiency of 12.2x. The addi-tional JJs required for bypass logic lowers the efficiency comparedto unreliable design. However, the fault-tolerant design enableslowering of the critical current ( I c ) from 38 µA to 10 µA , increasingthe energy efficiency to 46x (while having 1.2x the performance ofthe CMOS implementation). Superconducting (Unreliable) Superconducting (reliable)
Figure 12: Energy-efficiency of CMOS and JJ-based implemen-tations. Our final design has 46x improvement over CMOS-based implementation. Note: All JJ-based evaluations includea cooling overhead of 300x.
To the best of our knowledge, this is one of the first paper to ex-plore superconducting accelerators and evaluate the performanceand power using application-level metrics. As this is an emergingtechnology, there is no publicly available methodology or workflowfor evaluating performance, power, and area of systems built usingsuperconducting technology. Furthermore, standard cells and designrules in superconducting logic families are fundamentally differentfrom CMOS technology. For example, logic cells in JJ technologyhave limited driving capacity, and to drive more than one cell, abuffer like cell known as Josephson Junction Transmission Line(JTLs) must be placed between two cells. This limits the directusage of standard CMOS tools to perform a design space explorationfor superconducting technology. To overcome this problem, we useopen-source back-end design tools to incorporate design constraintsspecific to superconducting logic. In addition to the modified designtool, we use analytical models to calculate performance, power, andarea for superconducting logic. Figure 13 provides an overview ofthe workflow of tools used in our evaluation. QL Cell Lib [12]Design in HDL Yosys Open Source Synthesis Tool [32] Netlist Rules for JTL JJ Complexity PerformancePowerArea
Figure 13: Workflow for evaluating area, performance, and power of superconducting accelerators.
The area of a superconducting circuit is denoted by a term, calledas
JJ-complexity . The JJ-complexity is the number of JJs requiredto design a logic block. A logic block consists of logic gates andJosephson Junction Transmission Lines(JTLs). As JJ-based logicgates have limited driving strength, JTLs are inserted to facilitate thedesired fanout. In this paper, we use JJ-complexity as a key figureof merit, similar to prior superconducting system designs [4, 16].We evaluate the system level JJ-complexity by computing gate JJ-complexity and interconnect JJ-complexity.
Gate JJ-Complexity:
To evaluate gate JJ-complexity, we use theRQL standard cell library and
Yosys [11, 32], an open-source syn-thesis tool.
Yosys enables us to derive the gate level netlist usingonly RQL standard cells.
Yosys uses
ABC [33], that allows it tomap a design’s gate level representation to a target custom library(which is the RQL cell library for our analysis). We process thenetlist to compute the gate JJ-complexity by determining the numberof gates used of each type. Additionally, we also use
SynopsysDesign Compiler , a standard CMOS synthesis tool to generatethe gate-level design and post-process the design to optimize it usingRQL specific standard cells. The gate JJ-complexity obtained isthe same from both techniques. Note that the lack of place androute tools, and restricted access to foundry models forces the su-perconducting logic designers to use manual routing to computeJJ-complexity.
StandardCell JJ ComplexityAND-OR 2A-NOT-B 2XOR 4NRDO 5D-FlipFlop 5JTL 2
Flux biasClock line L L L L JJ JJ L L b L c Input
Output(a) (b)
Figure 14: (a) JJ Complexity of RQL Standard Cells (b) Joseph-son Transmission Line (JTL)Interconnect JJ Complexity:
As RQL gates have limited drivingstrength, JTLs are used to drive gates. As shown in Figure 14(b),each JTL comprises 2 JJs. JTLs enable fan-out capacity similar tobuffers in traditional CMOS circuits and limit clock-skew and jitter.Due to limited driving strength, RQL gates require one JTL for everyoutput load. We process the
Yosys generated netlist and determinethe fanout for every input and output port and internal wires. To account for JTL overheads we compute the number of JJs requiredusing the rules based on [3]:(1) A JTL is added after a series of five logic gates to suppressclock skew and jitter.(2) A JTL is required per fanout (a gate can drive a JTL, and aJTL can drive a gate and a JTL).(3) XOR gates need extra JTL because they operate at thephase boundary (RQL uses a four phase clock, two clocklines with a phase difference of π / provide two phaseseach [11]).Our analysis (Figure 6) shows that most of the gates drive either 1or 2 gates, and the percentage of gates that drive more than 2 gates isquite small (less than 1%). Given that approximately half the gatesdrive exactly 2 gates, the overhead of additional JTL due to fanoutis approximately 50% for our baseline implementation. System JJ complexity:
Full system design using superconduct-ing technology requires JJs for implementing logic and enablingsignal routing and fanout. We derive the system JJ-complexity ( J J system ) by adding gate JJ-complexity ( J J дate ) and interconnectJJ-complexity ( J J interconnect ) as shown in Equation 1. J J system = J J дate + J J interconnect (1)Table 5 shows the JJ-complexity of some commonly used logicblocks. For validation, we compare our method of evaluating JJ-complexity against published designs that use foundry RQL standardcell library based on foundry models and observe that our estimatesare within 12% of the numbers reported in prior work [4, 16].
Table 5: Evaluations for proposed Design Methodology
Logic Estimated Reported JJ Percentageblock JJ Complexity Complexity [4] Error
32 bit RCA Adder
32 bit KSA Adder
Integer Multiplier
RQL delivers power to on-chip devices through inductive couplingto an AC transmission line. As a result, RQL circuits dissipatenegligible static power. RQL uses reciprocal data encoding where “0”is represented by the absence of SFQ pulses. Therefore, the dynamicpower dissipation in RQL circuits result from only digital “1”s,and digital “0”s do not dissipate power. The total power dissipated(P dynamic ) by an RQL circuit is given by Equation 2. dynamic = · n · f · I c · ϕ · α (2)where, n is the number of JJs, f is the frequency, I c is the criticalcurrent, ϕ is a universal constant, and α is the activity factor (orthe percentage of JJs switching to “1” state). The power dissipatedby the superconducting logic is directly proportional to the criticalcurrent which depends on the device fabrication technology andfoundry services. For our evaluations, we assume the critical currentto be µA . However, a conservative analysis of critical currentreveals that it is possible to reduce it to µA without substantialimpact on the bit error rate [3]. We determine the activity factor ( α )of a design by counting the number of “1”s from the value changedump (VCD) file of random simulations. We evaluate the total powerconsumption ( P total ) by multiplying the power dissipated by thedesign at 4.2K with a cooling overhead as shown in Equation 3. P total = coolinд f actor · P dynamic (3) Note:
For all accelerators implemented using the JJ technol-ogy, we include a cooling overhead of 300x in our energy-efficiency calculations.
In order to model performance, we count the number of JJs in thecritical path of a design and multiply it by the switching time ofa JJ. We assume a uniform JJ switching time of 2 ps [3] for alldevices. Switching time can be improved by using larger featuresize. However, for our analysis, we lack the design and layout toolsto study such optimizations.
Superconducting circuits:
A Josephson junction based processorwas proposed as early as 1980 [34]. A number of circuits weredemonstrated in RSFQ logic in the 1990s, including DSPs [35], mi-croprocessor components [24, 36–39], mixed signal devices [40–42],floating point units [43, 44]. However, due to static power dissipa-tion challenges and high device counts per logic gate, RSFQ circuitsfaced scalability issues. With the introduction of the RQL familyof logic gates, designers were able to mitigate the disadvantagesof RSFQ. So far, several RQL family circuits are demonstrated in-cluding shift registers, an 8-bit carry-lookahead adder, shift registeryield vehicles, transmission driver systems, and serial data receiversystems [3, 12–15]. Dorojevets et. al. present resource estimates for32-bit and 64-bit integer and floating-point ALU, on-chip storageelements, and bloom filters [4, 16, 17]. Holmes et al. analyze thefeasibility of HPC systems with 1000 PFLOPs [6].
SHA Designs:
SHA-256 optimizations include changing thecomputational platform from general purpose processors to FP-GAs [45–49] and ASICs [26, 27, 50, 51]. Hardware optimiza-tions of SHA engines include use of carry save adders and com-bination of different types of adders [26, 27, 49, 52], pipeline de-signs [49, 52, 53, 53, 54], delay balancing [26], operation reschedul-ing [25, 26, 45, 46].
Reliable SHA cores:
Prior fault tolerant schemes for SHA hard-ware use triple modular redundancy [55], register protection usingHamming codes [56], and SHA cores with inbuilt self-checkingmechanisms [57, 58]. These schemes assume uncorrelated errorsand incur significant area and complexity.
In this paper, we evaluate the system level performance and energyimprovements for an accelerator built with Josephson junction tech-nology. We focus on three JJ-technology challenges: low devicedensity, limited fanout, and correlated faults due to flux trapping. Toleverage the existing technology with limited device density, we fo-cus on SHA-256 engines, that are commonly used in bitcoin-miningaccelerators. This application has high computational intensity, tinymemory footprint, and energy-efficiency is a key metric.A direct translation of CMOS design to JJ design of a baseline [9]provides 10x improvement in energy-efficiency (GH/J). We high-light the fan-out overhead in JJ technology, and how it impacts thedesign choices for arithmetic units and pipeline design. We studya technology-aware design that improves the performance by 1.6xwhile boosting the energy efficiency to 12x over CMOS baseline.We present a unique reliability challenge in JJ technology and pro-pose a fault-tolerant design that can protect against large granularityfaults that occur due to this new failure mode. Moreover, we utilizethis fault-tolerant design to enable better than worse case design thatenables scaling of the critical current without sacrificing function-ality and providing a 46x improvement in energy efficiency overCMOS design. We introduce a methodology for estimating area, per-formance, and power of accelerators built in superconducting logic.Such a workflow can help other researchers in exploring designsusing this technology. While we evaluate SHA-256 as an example,the JJ technology is also applicable to other domains.
ACKNOWLEDGMENTS
We thank Srilatha Manne, Elnaz Ansari, Zachary Myers for thetechnical discussions and feedback. This work was supported by agift from Microsoft Research.
REFERENCES [1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,S. Bhatia, N. Boden, A. Borchers, et al. , “In-datacenter performance analysis of atensor processing unit,” in
Computer Architecture (ISCA), 2017 ACM/IEEE 44thAnnual International Symposium on , pp. 1–12, IEEE, 2017.[2] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Al-kalay, M. Haselman, L. Adams, M. Ghandi, et al. , “A configurable cloud-scalednn processor for real-time ai,” in , IEEE, 2018.[3] Q. P. Herr, A. Y. Herr, O. T. Oberg, and A. G. Ioannidis, “Ultra-low-powersuperconductor logic,”
Journal of applied physics , vol. 109, no. 10, p. 103903,2011.[4] M. Dorojevets, Z. Chen, C. L. Ayala, and A. K. Kasperek, “Towards 32-bitenergy-efficient superconductor rql processors: the cell-level design and analysisof key processing and on-chip storage units,”
IEEE Transactions on AppliedSuperconductivity , vol. 25, no. 3, pp. 1–8, 2015.[5] A. Pedram, S. Richardson, M. Horowitz, S. Galal, and S. Kvatinsky, “Darkmemory and accelerator-rich system optimization in the dark silicon era,”
IEEEDesign & Test , vol. 34, no. 2, pp. 39–50, 2017.[6] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, “Energy efficient super-conducting computing power budgets and requirements,”
IEEE Transactions onApplied Superconductivity , vol. 23, no. 3, pp. 1701610–1701610, 2013.[7] M. L. Laboratory, “Beyond cmos superconducting digital circuits,”[8] S. K. Tolpygo, “Superconductor digital electronics: Scalability and energy effi-ciency issues,”
Low Temperature Physics , vol. 42, no. 5, pp. 361–379, 2016.
9] J. Barkatullah and T. Hanke, “Goldstrike 1: Cointerra’s first-generation cryp-tocurrency mining processor for bitcoin,”
IEEE micro , vol. 35, no. 2, pp. 68–76,2015.[10] A. de Vries, “Bitcoin’s growing energy problem,”
Joule , vol. 2, no. 5, pp. 801–805,2018.[11] O. T. Oberg,
Superconducting logic circuits operating with reciprocal magneticflux quanta . University of Maryland, College Park, 2011.[12] A. Y. Herr, Q. P. Herr, O. T. Oberg, O. Naaman, J. X. Przybysz, P. Borodulin,and S. B. Shauck, “An 8-bit carry look-ahead adder with 150 ps latency andsub-microwatt power dissipation at 10 ghz,”
Journal of Applied Physics , vol. 113,no. 3, p. 033911, 2013.[13] Q. P. Herr, J. Osborne, M. J. Stoutimore, H. Hearne, R. Selig, J. Vogel, E. Min,V. V. Talanov, and A. Y. Herr, “Reproducible operating margins on a 72 800-devicedigital superconducting chip,”
Superconductor Science and Technology , vol. 28,no. 12, p. 124003, 2015.[14] Q. P. Herr, E. Rudman, J. D. Egan, and V. V. Talanov, “Superconducting transmis-sion driver system,” May 24 2018. US Patent App. 15/356,049.[15] S. B. SHAUCK, “Reciprocal quantum logic (rql) serial data receiver system,”Sept. 25 2018. US Patent App. 10/083,148.[16] M. Dorojevets and Z. Chen, “Fast pipelined storage for high-performance energy-efficient computing with superconductor technology,” in
Emerging Technologiesfor a Smarter World (CEWIT), 2015 12th International Conference & Expo on ,pp. 1–6, IEEE, 2015.[17] M. Dorojevets, “Energy-efficient superconductor bloom filters for streaming datainspection,”
IEEE Transactions on Dependable and Secure Computing , 2018.[18] F. Wright and T. M. Conte, “Standards: Roadmapping computer technology trendsenlightens industry,”
Computer , no. 6, pp. 100–103, 2018.[19] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Darksilicon and the end of multicore scaling,” in
Computer Architecture (ISCA), 201138th Annual International Symposium on , pp. 365–376, IEEE, 2011.[20] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,”[21] K. J. O’Dwyer and D. Malone, “Bitcoin mining and its energy footprint,” 2014.[22] I. Magaki, M. Khazraee, L. V. Gutierrez, and M. B. Taylor, “Asic clouds: special-izing the datacenter,” in
ACM SIGARCH Computer Architecture News , vol. 44,pp. 178–190, IEEE Press, 2016.[23] M. B. Taylor, “The evolution of bitcoin hardware,”
Computer , vol. 50, no. 9,pp. 58–66, 2017.[24] M. Dorojevets, C. L. Ayala, N. Yoshikawa, and A. Fujimaki, “8-bit asynchronoussparse-tree superconductor rsfq arithmetic-logic unit with a rich set of operations,”
IEEE Transactions on Applied Superconductivity , vol. 23, no. 3, pp. 1700104–1700104, 2013.[25] K. K. Ting, S. C. Yuen, K.-H. Lee, and P. H. Leong, “An fpga based sha-256processor,” in
International Conference on Field Programmable Logic and Appli-cations , pp. 577–585, Springer, 2002.[26] L. Dadda, M. Macchetti, and J. Owen, “The design of a high speed asic unit forthe hash function sha-256 (384, 512),” in
Proceedings of the conference on Design,automation and test in Europe-Volume 3 , p. 30070, IEEE Computer Society, 2004.[27] L. Dadda, M. Macchetti, and J. Owen, “An asic design for a high speed imple-mentation of the hash function sha-256 (384, 512),” in
Proceedings of the 14thACM Great Lakes symposium on VLSI , pp. 421–425, ACM, 2004.[28] J. J. P. Eckert and J. W. Mauchly, “Memory system,” Feb. 24 1953. US Patent2,629,827.[29] I. L. Auerbach, J. P. Eckert, R. Shaw, and C. Sheppard, “Mercury delay linememory using a pulse rate of several megacycles,”
Proceedings of the IRE , vol. 37,no. 8, pp. 855–861, 1949.[30] M. Vilim, H. Duwe, and R. Kumar, “Approximate bitcoin mining,” in
DesignAutomation Conference (DAC), 2016 53nd ACM/EDAC/IEEE , pp. 1–6, IEEE,2016.[31] D. S. Holmes and J. McHenry, “Non-normal critical current distributions injosephson junctions with aluminum oxide barriers,”
IEEE Transactions on AppliedSuperconductivity , vol. 27, no. 4, pp. 1–5, 2017.[32] C. Wolf, “Yosys open synthesis suite,” 2016.[33] A. Mishchenko et al. , “Abc: A system for sequential synthesis and verification,” , p. 17, 2007.[34] W. Anacker, “Josephson computer technology: An ibm research project,”
IBMJournal of research and development , vol. 24, no. 2, pp. 107–112, 1980.[35] D. Gupta, T. V. Filippov, A. F. Kirichenko, D. E. Kirichenko, I. V. Vernik, A. Sahu,S. Sarwana, P. Shevchenko, A. Talalaevskii, and O. A. Mukhanov, “Digitalchannelizing radio frequency receiver,”
IEEE Transactions on applied supercon-ductivity , vol. 17, no. 2, pp. 430–437, 2007.[36] O. A. Mukhanov, “Rapid single flux quantum (rsfq) shift register family,”
IEEEtransactions on applied superconductivity , vol. 3, no. 1, pp. 2578–2581, 1993.[37] O. A. Mukhanov, “Rsfq 1024-bit shift register for acquisition memory,”
IEEEtransactions on applied superconductivity , vol. 3, no. 4, pp. 3102–3113, 1993.[38] T. Filippov, M. Dorojevets, A. Sahu, A. Kirichenko, C. Ayala, and O. Mukhanov,“8-bit asynchronous wave-pipelined rsfq arithmetic-logic unit,”
IEEE Transactionson Applied Superconductivity , vol. 21, no. 3, pp. 847–851, 2011. [39] Y. Yamanashi, M. Tanaka, A. Akimoto, H. Park, Y. Kamiya, N. Irie, N. Yoshikawa,A. Fujimaki, H. Terai, and Y. Hashimoto, “Design and implementation of apipelined bit-serial sfq microprocessor, core 1 β ,” IEEE transactions on appliedsuperconductivity , vol. 17, no. 2, pp. 474–477, 2007.[40] S. V. Rylov and R. P. Robertazzi, “Superconducting high-resolution a/d con-verter based on phase modulation and multichannel timing arbitration,”
IEEETransactions on Applied Superconductivity , vol. 5, no. 2, pp. 2260–2263, 1995.[41] A. Inamdar, S. Rylov, A. Talalaevskii, A. Sahu, S. Sarwana, D. E. Kirichenko,I. V. Vernik, T. V. Filippov, and D. Gupta, “Progress in design of improvedhigh dynamic range analog-to-digital converters,”
IEEE Transactions on AppliedSuperconductivity , vol. 19, no. 3, pp. 670–675, 2009.[42] A. Kirichenko, S. Sarwana, D. Gupta, I. Rochwarger, and O. Mukhanov, “Multi-channel time digitizing systems,”
IEEE transactions on applied superconductivity ,vol. 13, no. 2, pp. 454–458, 2003.[43] O. A. Mukhanov and A. F. Kirichenko, “Implementation of a fft radix 2 butterflyusing serial rsfq multiplier-adders,”
IEEE Transactions on Applied Superconduc-tivity , vol. 5, no. 2, pp. 2461–2464, 1995.[44] T. Kainuma, Y. Shimamura, F. Miyaoka, Y. Yamanashi, N. Yoshikawa, A. Fuji-maki, K. Takagi, N. Takagi, and S. Nagasawa, “Design and implementation ofcomponent circuits of an sfq half-precision floating-point adder using 10-ka/cmnb process,”
IEEE Transactions on Applied Superconductivity , vol. 21, no. 3,pp. 827–830, 2011.[45] R. Chaves, G. Kuzmanov, L. Sousa, and S. Vassiliadis, “Cost-efficient sha hard-ware accelerators,”
IEEE transactions on very large scale integration (VLSI)Systems , vol. 16, no. 8, pp. 999–1008, 2008.[46] R. Chaves, G. Kuzmanov, L. Sousa, and S. Vassiliadis, “Improving sha-2 hardwareimplementations,” in
International Workshop on Cryptographic Hardware andEmbedded Systems , pp. 298–310, Springer, 2006.[47] O. Esuruoso, “High speed fpga implementation of cryptographic hash function,”2011.[48] I. Ahmad and A. S. Das, “Hardware implementation analysis of sha-256 andsha-512 algorithms on fpgas,”
Computers & Electrical Engineering , vol. 31, no. 6,pp. 345–360, 2005.[49] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P. Marnane, “Optimisation ofthe sha-2 family of hash functions on fpgas,” in
Emerging VLSI Technologies andArchitectures, 2006. IEEE Computer Society Annual Symposium on , pp. 6–pp,IEEE, 2006.[50] M. Kim, J. Ryou, and S. Jun, “Efficient hardware architecture of sha-256 algo-rithm for trusted mobile computing,” in
International Conference on InformationSecurity and Cryptology , pp. 240–252, Springer, 2008.[51] A. Satoh and T. Inoue, “Asic-hardware-focused comparison for hash functionsmd5, ripemd-160, and shs,”
INTEGRATION, the VLSI journal , vol. 40, no. 1,pp. 3–10, 2007.[52] R. Lien, T. Grembowski, and K. Gaj, “A 1 gbit/s partially unrolled architec-ture of hash functions sha-1 and sha-512,” in
Cryptographers Track at the RSAConference , pp. 324–338, Springer, 2004.[53] M. Macchetti and L. Dadda, “Quasi-pipelined hash circuits,” in
Computer Arith-metic, 2005. ARITH-17 2005. 17th IEEE Symposium on , pp. 222–229, IEEE,2005.[54] F. Crowe, A. Daly, T. Kerins, and W. Marnane, “Single-chip fpga implementa-tion of a cryptographic co-processor,” in
Field-Programmable Technology, 2004.Proceedings. 2004 IEEE International Conference on , pp. 279–285, IEEE, 2004.[55] M. Juliato, C. Gebotys, and R. Elbaz, “Efficient fault tolerant sha-2 hash functionsfor space applications,” in
Aerospace conference, 2009 IEEE , pp. 1–16, IEEE,2009.[56] M. Juliato and C. Gebotys, “Seu-resistant sha-256 design for security in satel-lites,” in
Signal Processing for Space Communications, 2008. SPSC 2008. 10thInternational Workshop on , pp. 1–7, IEEE, 2008.[57] H. E. Michail, A. Kotsiolis, A. Kakarountas, G. Athanasiou, and C. Goutis,“Hardware implementation of the totally self-checking sha-256 hash core,” in
EUROCON 2015-International Conference on Computer as a Tool (EUROCON),IEEE , pp. 1–5, IEEE, 2015.[58] H. E. Michail, G. S. Athanasiou, G. Theodoridis, A. Gregoriades, and C. E.Goutis, “Design and implementation of totally-self checking sha-1 and sha-256hash functions architectures,”
Microprocessors and Microsystems , vol. 45, pp. 227–240, 2016., vol. 45, pp. 227–240, 2016.