Orchestrated Trios: Compiling for Efficient Communication in Quantum Programs with 3-Qubit Gates
Casey Duckering, Jonathan M. Baker, Andrew Litteken, Frederic T. Chong
OOrchestrated Trios: Compiling for Efficient Communicationin Quantum Programs with 3-Qubit Gates
Casey Duckering
University of ChicagoChicago, Illinois, USA
Jonathan M. Baker
University of ChicagoChicago, Illinois, USA
Andrew Litteken
University of ChicagoChicago, Illinois, USA
Frederic T. Chong
University of ChicagoChicago, Illinois, USA
ABSTRACT
Current quantum computers are especially error prone and requirehigh levels of optimization to reduce operation counts and maximizethe probability the compiled program will succeed. These comput-ers only support operations decomposed into one- and two-qubitgates and only two-qubit gates between physically connected pairsof qubits. Typical compilers first decompose operations, then routedata to connected qubits. We propose a new compiler structure,Orchestrated Trios, that first decomposes to the three-qubit Toffoli,routes the inputs of the higher-level Toffoli operations to groups ofnearby qubits, then finishes decomposition to hardware-supportedgates.This significantly reduces communication overhead by givingthe routing pass access to the higher-level structure of the circuitinstead of discarding it. A second benefit is the ability to now selectan architecture-tuned Toffoli decomposition such as the 8-CNOTToffoli for the specific hardware qubits now known after the routingpass. We perform real experiments on IBM Johannesburg showingan average 35% decrease in two-qubit gate count and 23% increase insuccess rate of a single Toffoli over Qiskit. We additionally compilemany near-term benchmark algorithms showing an average 344%increase in (or 4.44x) simulated success rate on the Johannesburgarchitecture and compare with other architecture types.
CCS CONCEPTS • Computer systems organization → Quantum computing ; •
Software and its engineering → Compilers . KEYWORDS quantum computing, NISQ, compiler, Toffoli
ACM Reference Format:
Casey Duckering, Jonathan M. Baker, Andrew Litteken, and Frederic T.Chong. 2021. Orchestrated Trios: Compiling for Efficient Communicationin Quantum Programs with 3-Qubit Gates. In
Proceedings of the 26th ACMInternational Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS ’21), April 19–23, 2021, Virtual, USA.
ACM,New York, NY, USA, 12 pages. https://doi.org/10.1145/3445814.3446718
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
ASPLOS ’21, April 19–23, 2021, Virtual, USA © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-8317-2/21/04...$15.00https://doi.org/10.1145/3445814.3446718 (a) Expensive Qiskit routing (b) Efficient Trios routing
Figure 1: Example routing from Qiskit (a) vs. Trios (b)for a single Toffoli operation. Circles represent qubits andlines indicate two qubits are connected. Input qubits arehighlighted in red. SWAP arrows are labeled by timestep.The routed locations for Trios routing are highlighted ingreen while Qiskit moves them several times. Qiskit adds16 SWAPs (=48 CNOTs), some during the Toffoli, while Triosadds only 7 SWAPs (=21 CNOTs) all before the Toffoli. Per-forming multiple passes of decomposition allows directrouting and enables this huge reduction in communication,increasing the probability of program success.
In recent years, quantum hardware has improved dramatically interms of number of accessible quantum bits (qubits), device errorrates, and qubit lifetimes. However, we are still years away fromobtaining fully error corrected devices which are required to run im-portant algorithms like Grover’s [15] and Shor’s [32]. In the currentNoisy-Intermediate Scale Quantum (NISQ) [28] era, where despiterecent substantial improvements, error rates on current devices are ©The authors 2021. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive version was published in
Proceedings of the 26th ACM InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21) , https://doi.org/10.1145/3445814.3446718. a r X i v : . [ qu a n t - ph ] F e b SPLOS ’21, April 19–23, 2021, Virtual, USA Casey Duckering, Jonathan M. Baker, Andrew Litteken, and Frederic T. Chong still prohibitive, requiring programs to be highly optimized to havea good chance at succeeding.Quantum program compilation involves many passes of trans-formations and optimizations similar in many ways to classicalcompilers. Some optimizations occur at the abstract circuit level,independent of the underlying hardware, such as gate cancellation[26]. One of the first steps usually taken is to convert an inputprogram into a gate set (ISA) supported by the target hardware.For example, on IBM devices, gates are typically rewritten usingonly gates in the set { 𝑢 , 𝑢 , 𝑢 , 𝑐𝑥 } [18] (single-qubit gates andthe common CNOT gate described later). One critical limitationof many current available architectures is the inability to executemore complex multi-qubit operations, like the Toffoli, directly; in-stead, these gates must be decomposed into the supported one- andtwo-qubit gates. Furthermore, many current superconducting archi-tectures only support two qubit operations on adjacent hardwarequbits wired together with a coupler. This requires the insertion ofadditional operations called SWAPs to move the data onto adjacent(and connected) qubits.The process of transforming an optimized and decomposed pro-gram to the desired target is typically broken down into threedistinct steps: decomposing the program into basic gates, mappingthe logical qubits of a program to hardware qubits and routinginteracting qubits so that they are adjacent on hardware when theyinteract, and scheduling operations in order to minimize total pro-gram run time (depth) or to minimize errors due to crosstalk [25].Each of these steps is critical to the success of the input program. Awell-mapped and well-routed program will reduce the total numberof communication operations added and subsequently reduce thecompiled program’s depth, both of which will increase the chanceof success. Conventionally, these three steps occurs sequentially.By doing so, current strategies are unable to account for structurein the input program, resulting in inefficient routing of qubits. Anoptimal compiler could find the best routing despite the lack ofstructure but at the cost of much slower compilation. Considerthe SWAP paths inserted by IBM’s Qiskit compiler for a singleToffoli compiled to IBM’s Johannesburg device in Figure 1a. Thisbaseline strategy adds a large number of unnecessary SWAPs as itindividually routes each CNOT composing the Toffoli, dramaticallyreducing the probability of successful execution.Our approach, Orchestrated Trios (Trios) decomposes and routesqubits in multiple stages, as seen in Figure 2b. For example, firstdecompose an input program to one- two-, and three-qubit gates(e.g. do not decompose Toffoli gates) and route as before except forthree-qubits, route all three to a common location with minimalSWAPs. This new program can then undergo a second round ofdecomposition to produce a circuit containing only hardware per-mitted one- and two-qubit gates. The second round may use thenow known mapping (locations of data qubits on the device) togenerate fine-tuned decompositions for the architecture.This layered approach has a major advantage over current rout-ing techniques: we are better able to capture program structure byinspecting intermediate complex operations for routing. This betterinforms how qubits should be moved around the device duringprogram execution. In Figure 1, the Trios strategy reduces the totalnumber of SWAPs added to 21: fewer than half compared to Qiskit. Input programUnroll+DecomposeCircuit of 1- and 2-qubit gates(between any qubit pairs)Map and RouteCircuit of 1- and 2-qubit gates(between connected qubits)ScheduleExecutable Circuit Input programUnroll+Decompose to ToffoliCircuit of 3-qubit Toffoliand other 1- and 2-qubit gatesMap and RouteCircuit of Toffoli gatesbetween nearby qubitsMapping-Aware DecomposeCircuit of 1- and 2-qubit gates(between connected qubits)ScheduleExecutable Circuit (a) Conventional compilation (b) Trios compilation
Figure 2: (a) Typical compilation passes used by Qiskit (sim-plified). (b) Trios compilation passes.
This was an extreme example we selected to present the issue, notan average case.We specifically propose a two-pass approach to circuit decom-position. We will focus on superconducting hardware systemslike IBM’s cloud accessible devices, but our strategy can easily beadapted to other systems. An overview of our compilation structureis found in Figure 2b. This strategy has a substantial benefit on theoverall success rate of programs. We demonstrate these improve-ments by executing Toffoli gates on a real IBM quantum computerand estimating success probability of a suite of benchmarks viasimulation.Our contributions are as follows: • A new compiler structure, Trios, with two passes for de-composition with a modified routing pass in between whichgreatly improves qubit routing. • A simple method for architecture-tuned Toffoli decomposi-tions during the second decompose pass that allows for anew kind of location-aware optimization. • On Toffoli-only experiments, Trios reduces the total numberof gates by 35% geomean (geometric mean) resulting in 23%geomean increase in success rate when run on real IBMhardware as compared to Qiskit. • On near-term algorithms shown in Figure 11 (4 to 20 qubitbenchmarks), Trios reduces total gate count by 37% geomeanresulting in 344% geomean increase in (or 4.44x) simulatedsuccess rate on IBM Johannesburg with noise rates of near-future hardware as compared to programs compiled withoutTrios. A sensitivity analysis over four architecture typesshows the benefit range from 133% to 3020% increase insuccess rate.2 rchestrated Trios: Compiling for Efficient Communication in Quantum Programs with 3-Qubit Gates ASPLOS ’21, April 19–23, 2021, Virtual, USA
The most basic object in quantum computing is the quantum bit(qubit). Unlike a classical bit which is either 0 or 1, the qubit hastwo basis states | ⟩ and | ⟩ and can exist as a linear superpositionover these two states, i.e. for a quantum state | 𝜓 ⟩ = 𝛼 | ⟩ + 𝛽 | ⟩ with 𝛼, 𝛽 ∈ C and ∥ 𝛼 ∥ + ∥ 𝛽 ∥ =
1. In general, a quantum systemconsisting of 𝑛 qubits can exist in a linear superposition of 2 𝑛 basisstates in contrast to a classical system of 𝑛 bits which can exist asexactly a single of these states. An important feature which givesquantum computing its power is the ability to entangle qubits viatwo qubit operations like the CNOT. This, along with quantuminterference between the complex amplitudes, allows quantumprograms to solve certain problems faster than classical computers.While a qubit system can exist in these superpositions duringcomputation, at the end of the computation, the qubits are mea-sured producing a classical binary outcome. The probability of eachoutcome depends on the amplitude of each basis state (the values of 𝛼, 𝛽,𝛾, . . . ). Consequently, since the outcome of a quantum programis a classical bitstring and because quantum systems are inherentlynoisy, programs are usually run thousands of times to obtain adistribution over possible answers. A comprehensive backgroundcan be found in [27]. Quantum programs are typically represented as a circuit which,like a classical program, is an ordered list of instructions. Here theinstructions are quantum logic gates applied to qubits. The inputcircuit may not be expressed in the instruction set supported by theunderlying hardware or it might even be structured as hierarchicalmodules.Quantum circuits have a single line for each qubit, with timeflowing from left to right. Gates in a quantum circuits have the samenumber of inputs and outputs and gates on disjoint sets of linescan be executed in parallel. Single qubit gates are represented as abox labeled with the indicated operation and controlled operations,like the CNOT and Toffoli, have one or two controls respectivelyindicated by • and target given by ⊕ .Currently available superconducting quantum hardware, likethat of IBM and Rigetti, only supports one-qubit gates and two-qubit gates on specific pairs. Therefore, more complex instructionsmust be decomposed into multiple simpler, supported operations.For example, many quantum algorithms and subroutines make useof the Toffoli gate, a three-input gate which performs the logicalAND between two controls bits and writes the output onto thetarget bit. This gate cannot be executed directly on available hard-ware and instead is decomposed into an equivalent sequence ofone- and two-qubit operations. Two such popular decompositionsare given in Figures 3, 4. There are two key distinctions in thesedecompositions illustrating a more general trade off. The first [27]is the most popular decomposition using only 6 CNOT gates butrequires CNOTs between all three pairs of qubits. This would re-quire inserted SWAPs or a device connectivity containing a triangle.The second [31] uses a total of 8 CNOT gates and requires all threeinputs be only linearly connected (only two of the three qubit pairsare required to be connected). While the first is apparently more • • • • 𝑇 •• = • • 𝑇 𝑇 † 𝐻 𝑇 † 𝑇 𝑇 † 𝑇 𝐻
Figure 3: A 6-CNOT decomposition of the Toffoli gate. • 𝑇 • • • •• = 𝑇 • • 𝑇 † • • 𝐻 𝑇 𝑇 𝑇 † 𝑇 † 𝐻 Figure 4: An 8-CNOT decomposition of the Toffoli gate. efficient, this is not true if the connectivity of the underlying hard-ware does not directly support it. It is more efficient to use the8-CNOT version than use the 6-CNOT version with added SWAPs.For superconducting qubits, current quantum computers sup-ports gates only between adjacent hardware qubits. In order to usequbits which are currently mapped far apart on the hardware, extraSWAP operations must be inserted, each of these SWAPs is usuallydecomposed as a series of 3 CNOT gates (equivalent to a classicalmemory in-place swap using 3 alternating XORs). In the case of the6-CNOT Toffoli decomposition above, when mapped to a devicewith linear or square grid connectivity, no triangles exist so extraSWAPs will need to be inserted, resulting in a greater total numberof CNOTs due to the mismatch with hardware details.
In this paper we focus primarily on currently available supercon-ducting quantum devices. This type of hardware is the primary fo-cus of many industry players like IBM, Rigetti, and Google [1, 18, 34].We show some representative topologies for superconducting de-vices in Figure 5abd. For completeness, we include a clustered deviceshown in Figure 5c representative of a QCCD ion trap device suchas [22]. These systems exhibit all of the properties previously dis-cussed. They have a small universal supported gate set which allprograms must be transformed into and only support local two-qubit operations. The connectivity of these devices is given as a coupling graph specifying which pairs of qubits can execute CNOTs.Furthermore, these systems are subject to a wide variety of noisewhich cause programs to fail. Some noise is due to manufacturingimperfections or calibration error. Some is inherent to quantumprogram execution resulting from the imperfect physical isolationof the qubits from the environment required to manipulate thequantum state [20]. In IBM machines, the experimental devices ofthis work, single qubit gate errors are small, occurring on average 1in 2000 operations. CNOT gate errors are more significant occurringon roughly 1 in 100 gates. Measurement error is also significant,with errors on the same order of magnitude as CNOT gates. Finally,qubit lifetimes (coherence times) are relatively short, allowing onthe order of 50 CNOT gate durations before the qubit state is lost[2] (but gates can often run in parallel while imposing additional3
SPLOS ’21, April 19–23, 2021, Virtual, USA Casey Duckering, Jonathan M. Baker, Andrew Litteken, and Frederic T. Chong (a) IBM Johannesburg (b) 2D Grid (c) 4, 5 qubit fully connected clusters (d) Linear
Figure 5: Example topologies of near-term quantum devices. Orange (a): IBM Johannesburg. Yellow (b): 2D Grid. Purple (c):four groups of five fully connected clusters. Green (d) Linear. Our real experiments run on Johannesburg and our simulationsexplore all of these topologies. Colors correspond with the bars in Figures 9, 10, 11. crosstalk error). Therefore, quantum compilation is essential toreduce both of these sources of error: add as few extra gates aspossible and minimize total execution time.
In the NISQ era, quantum programs are highly optimized in orderto reduce the effect on errors and maximize the probability thecorrect answer is observed. Similar to many classical programs,compilation uses a pass structure, where a set of transformationand optimizations are applied in a fixed order resulting in the com-pilation of an input quantum program to an executable for thetarget hardware [19, 26]. For the most part, these optimizationstake place at the circuit-gate level. Some optimizations are hard-ware independent, for example, reducing total number of gates viacommutativity-aware gate cancellation or find and replace withcircuit identities. Other passes are focused on decomposing gatesinto the hardware’s ISA [4, 21, 30].One of the most important parts of this compilation process ismapping and routing the optimized program to one executable onthe target hardware, typically done post-decomposition. Quantummechanics imposes new constraints on these than classical com-pilation or logic synthesis. By the no cloning theorem, quantumstates cannot be copied, only entangled, which prevents fan-out orfan-in. Instead, the data must be routed sequentially (i.e. swappedwith SWAP gates) to each place it is needed.Compilation involves three main steps. First, mapping programqubits to hardware qubits in order to minimize the total distancebetween qubits that will need to be close by in the future [23, 37, 38].Second, routing pairs of CNOT inputs to be adjacent by insertingSWAPs [10, 17]. Finally, scheduling operations to minimize totalexecution time [16, 25]. In general, the compilation problem iscomputationally hard and while some attempts at optimal solutionshave been pursued [33, 36, 40] the dominant approach is heuristics. In this work we focus on two pieces of this compilation problem:decomposition and routing.IBM’s Qiskit compiler, the standard for compiling programs toexecute on an IBM device, has a default sequence of passes. First,all high level optimization and analysis passes are performed andall gates are unrolled and decomposed to the target gate set. Thensingle passes of mapping, routing, and scheduling are performed[3].
When evaluating compiler methods, we use a few metrics to com-pare our results. Our primary metric is program success rate, thefraction of circuit executions that result in the correct output. Oth-ers use fidelity which can stand-in for success rate when evaluatingsub-circuits where the output is not measured. When executing aquantum algorithm, the corresponding quantum circuit is typicallyexecuted thousands of times to gather output statistics or identifythe error-free result.Program success rate is highly dependent to the noise charac-teristics of the quantum computer the program runs on. The ratesof these device errors can fluctuate day-to-day so we also use thesimpler metric of two-qubit gate count. The number of two-qubitoperations in the final compiled circuit is inversely correlated withthe success rate because they are usually the largest source of noise.
Simulating general quantum systems is exponentially expensivein the size of the system and therefore it is difficult to realisticallymodel all of the errors during the execution of a quantum program.We use a simplified model for simulation to predict, specificallyobtain a close upper bound on, the success rate of a program withspecified gate error rates and qubit coherence times. In our simpli-fied model, we compute the probability of a program succeeding4 rchestrated Trios: Compiling for Efficient Communication in Quantum Programs with 3-Qubit Gates ASPLOS ’21, April 19–23, 2021, Virtual, USA as the probability that no gate errors occur ( 𝑝 𝑔𝑎𝑡𝑒 ) 𝑛 𝑔𝑎𝑡𝑒𝑠 times theprobability no coherence errors occur 𝑝 𝑐𝑜ℎ𝑒𝑟𝑒𝑛𝑐𝑒 , where the latteris computed as 𝑒 Δ / 𝑇 + Δ / 𝑇 , where Δ is the total program durationand 𝑇 and 𝑇 are the relaxation and dephasing times, collectivelydecoherence.Current error rates, while rapidly improving, are still insufficientto obtain high probabilities of success, making it difficult to com-pare our mid-size benchmarks that are large enough to need manySWAPs. For our simulations we use error rates 20x improved overcurrent IBM Johannesburg error rates to obtain reasonable successrates and we study sensitivity to this choice later. In this section we motivate the need for a split decomposition passwith routing in between. We look closely at the Qiskit compilerwhich does not effectively account for the structure in programs.It often produces circuits with an excessive number of swaps sug-gesting room for improvement.The default compilation framework in Qiskit used to transforminput circuits to be executed on their hardware ensures a fullydecomposed circuit before mapping, routing, and scheduling occur.As a simple example, consider three qubits placed fairly distanton IBM’s Johannesburg device but for which we need to executea Toffoli gate on them; as in Figure 1a. Qiskit decomposes thisToffoli as in Figure 3 with 6 CNOTs. Each CNOT acts on distantqubits so the many SWAPs inserted for all 6 CNOTs gets expensivequickly. When routing, we first SWAP the first interacting pairtogether (usually by adding SWAPs from control to target or thereverse, but a meet-in-the-middle strategy is also possible) and thequbit mapping is updated. The next CNOT is also distant so we addSWAPs to move them together and there is an even chance thatthe SWAPs for the second CNOT separate the two qubits that werejust brought together.Ideally, we move the third qubit to the already adjacent pair, butQiskit cannot recognize this situation and could just as well movethe other way. This is clearly sub-optimal and could continue on forthe other four SWAPs. Even in the case where it makes the correctdecision to move the distant third qubit, there are problems. Becausepair of qubits needs to interact we may need single additionalSWAPs as the qubits compete to be neighbors. This causes the6-CNOT Toffoli decomposition to use many more than 6 CNOTswhen there is not a triangle in the qubit connectivity graph. Thecore idea is that the routing strategy fails to take advantage of twothings. First, it has effectively forgotten the desired operation is aToffoli which will require all three qubits be adjacent and secondthat a more efficient Toffoli decomposition could be chosen which ismore suitable for the underlying device architecture. In the example,inefficient compilation adds a total of 16 SWAPS or 48 CNOTs intotal.Some approaches in the past have attempted to solve the firstof these problems, for example by using lookahead when choosingrouting strategies [7, 39] and while this helps to treat the symp-toms of pre-decomposing all operations it does not remedy theunderlying problem.
In this section we describe our proposed compilation structure com-pared to the conventional one as outlined in Figure 2. Specifically,we focus on improving the routing and decomposition stages ofcompilation. Previously, we identified a key problem in currentmethods: decomposing the program to one- and two-qubit gates upfront hinders the ability of heuristic-based compilers to effectivelyminimize the communication cost, i.e. the number of SWAPs added,and eliminates the possibility of location-aware decompositions.We propose a new pass structure. Rather than performing asingle round of decomposition and routing, we propose a splitapproach. Any program processing prior to decomposition staysthe same. The decomposition pass is then divided so the majorityof decomposition occurs next but any Toffoli gates are left as-isbefore moving on to mapping and routing.The mapping and routing passes come next like normal but mustbe modified slightly to handle three-qubit gates. The mapper cansimply treat the non-decomposed Toffoli as it would the equiva-lent 6 CNOTs for the purposes of determining which qubits mostneed to be placed nearby. We then do the modified routing pass,moving groups of qubits together instead of only pairs where allor all-but-one of the group are moved into a single neighborhoodvia SWAPs. This greatly improves the effectiveness of the routingheuristics when applied to this modified routing pass. There aresome subtleties when coordinating the routing of multiple qubits tothe same place to ensure the paths don’t overlap. For the purposesof our evaluations we do the following but many similar heuristicstrategies are possible.Taking the next operation to apply, we first find the shortestpaths (using any shortest path algorithm on a graph) between allthe pairs of qubits. We choose the qubit with the shortest sum ofpaths to the other two qubits as the destination. SWAPS followingthese two paths are then inserted into the circuit. The two shortestpaths are checked for overlap. If the ending points overlap, thesecond is only routed to the penultimate hardware location alongthe swap path and the first becomes the middle qubit adjacent toboth others. This can save one valuable SWAP but doesn’t affectthe correctness. Once they are adjacent, the Toffoli gate is now onadjacent qubits and routing can continue to the next operation.Finally, the second decomposition pass is run. This is differentfrom normal decomposition as there are only Toffoli gates to decom-pose and they are already mapped to neighboring qubits. We coulduse the default 6-CNOT decomposition and still get the above ben-efit of improved routing but now that we have more information,this can be exploited to further reduce SWAPs due to a mismatchbetween the decomposition and the hardware connectivity. If allthree pairs of qubits are connected, then the 6-CNOT Toffoli ofFigure 3 is best, otherwise use the 8-CNOT Toffoli of Figure 4, en-suring the middle qubit is used for the middle of the decomposition(Any of the three qubits can be the target by simply moving thetwo H gates to that qubit).When routing complex operations like the Toffoli, we recognizethe underlying hardware does not usually support triangles in theconnectivity graph but linear connectivity is sufficient for a decentdecomposition. Since we are creating operations on three qubits,the qubits must be routed into a valid linear connectivity. That is, a5
SPLOS ’21, April 19–23, 2021, Virtual, USA Casey Duckering, Jonathan M. Baker, Andrew Litteken, and Frederic T. Chong configuration where each qubit is connected with at least one ofthe other qubits.This method can be easily extended to be noise-aware like previ-ous work [23, 37] by using a noise-aware mapper with the simplemodification described earlier where the path-finding graph hasweighed edges with the – log value of the CNOT success rate. Thepath distance represents the – log probability of success of that par-ticular path where lower values indicate a higher success rate andthe shortest path can be found just as before and the routing stepsare unchanged. Any routing strategy designed for one and two-qubit gates can be modified to work for one, two, and three-qubitgates and used as the first routing step of Trios.In programs where there are no three qubit gates as in the typ-ical NISQ benchmark, Bernstein-Vazirani [9], which is specifieddirectly as CNOT gates, our strategy will have no effect. Manybenchmarks, however, are written using Toffoli gates because theyare the quantum analog the AND gate ubiquitous in arithmeticsand other common subroutines.Trios can naturally be extended to any multi-qubit operation ofthree or more qubits but this introduces the challenges of simulta-neously routing many qubits and of designing decompositions thatare efficient with whichever grouping the simultaneous router canachieve. It is not obvious how to route more than three qubits intoa line or other desired shape. As many NISQ benchmarks are nottypically written with more complex structures and usually phrasethem in terms of one-, two-, and three-qubit gates, this extensionmay only be desirable for larger-scale quantum computing.
We first evaluate the effect of our new compilation strategy bystudying simple circuits containing only a single Toffoli gate. Inthese experiments, we place the three input qubits at random loca-tions on the target hardware to emulate the potential locations ofthe qubits at some intermediate point in the execution of a morecomplex circuit.We study these circuits on a real IBM device, namely IBM Johan-nesburg, a 20-qubit device with limited connectivity in Figure 5a.We use the default Qiskit compiler which decomposes the Toffoligates before doing shortest path routing compared to our proposedmethod where we do shortest path routing first and then decomposethe Toffoli. We study the use of two different Toffoli implementa-tions, a 6 CNOT decomposition with full qubit connectivity and an8 CNOT decomposition with linear qubit connectivity.In all four configurations, we compare the total compiled CNOTcounts which correlates with the total success probability of aprogram. For execution on Johannesburg, we prepare the qubitsin the states | ⟩ , perform the compiled Toffoli, then measurethe three qubits of interest and compute the success rate as theprobability of obtaining the correct answer (here the | ⟩ state),where each experiment is performed with 8192 trials. We also study Trios on real quantum benchmarks of moderate sizeusing simulation only. The error rates of current devices are still too high to run benchmarks of these sizes but are expected to runon current devices as errors improve in the near future. We chooseerror rates 20x better than Johannesburg rates as this make theestimated success probabilities within a reasonable range and is arealistic near-term estimate. We discuss sensitivity to this choicelater.We study four implementations of the many-controlled-NOT(CnX) gate. This subroutine has many use cases from Grover’s algo-rithm to various arithmetics. The implementations take advantageof differing numbers of ancilla and are chosen based on the numberof available qubits on hardware. We study three adder implementa-tions: Cuccaro, Takahashi, and QFT. The first two have many usesof the Toffoli gate while the latter has no such gates, for compari-son. We study a small version of Grover’s algorithm as well whichmakes use of the cnx_logancilla subroutine. Finally, we compiletwo common NISQ benchmarks: QAOA for Max-Cut and BernsteinVazirani (BV). We expect no gain on these benchmarks since theydo not contain any Toffoli gates. A summary of our benchmarks isfound in Table 1 using implementations found in [5].As noted previously, the connectivity of the underlying hard-ware has a significant impact on the number of required SWAPs.For example, on a completely connected set of qubits, no SWAPs areever needed. In architectures with greater connectivity, we may optfor a more efficient Toffoli decomposition using 6 CNOTs. With sim-ulation we study the effect of connectivity on the overall expectedsuccess rates and gate counts. We study four different connectivitymodels, all shown in Figure 5, each with 20 qubits, the topology ofIBM’s Johannesburg device containing four connected rings, a 2Dmesh, a line, and a small clustered architecture representative of aQCCD ion trap.Benchmark Qubits Toffolis CNOTs* cnx_dirty [6] 11 16 128 cnx_halfborrowed [14] 19 32 256 cnx_logancilla [8] 19 17 136 cnx_inplace [14] 4 54 490 cuccaro_adder [11] 20 18 190 takahashi_adder [35] 20 18 188 incrementer_borrowedbit [14] 5 50 448 grovers [15] 9 84 672 qft_adder [29] 16 0 92 bv [9] 20 0 19 qaoa_complete [13] 10 0 90 Table 1: Details about our benchmarks both NISQ programsand other quantum subroutines. We consider circuits withand without Toffoli gates where we expect advantage onlyfor circuits containing Toffoli gates. For BV we assume theall 1-bit string. The different CnX (many-controlled-NOT)gates use various numbers of ancilla. *The total number ofCNOT gates is after decomposition with the 8-CNOT Toffolibut does not including any SWAPs for routing. rchestrated Trios: Compiling for Efficient Communication in Quantum Programs with 3-Qubit Gates ASPLOS ’21, April 19–23, 2021, Virtual, USA We use error rates reported by IBM obtained via randomizedbenchmarking on a daily basis; for simulations we use error num-bers obtained from Johannesburg obtained on 8/19/2020 with anaverage T1 time of 70 . 𝜇𝑠 , T2 time of 72 . 𝜇𝑠 , two qubit gate timeof 0 . 𝜇𝑠 , a one qubit gate time of 0 . 𝜇𝑠 , two qubit gate errorof 0.0147, one qubit gate error of 0.0004. Source code for all ex-periments is available at [12]. Experiments using IBM are testedwith version 0.14.0 through their Python API. When compilingwith Qiskit for the single Toffoli experiments, we use the defaultsettings for the transpile function while specifying the Johan-nesburg backend. This means light optimization is performed: astochastic routing policy is chosen, and some simple optimizationssuch as single qubit gate consolidation is performed. We fix theinitial mapping to force routing to occur. In both sets of experiments, the total number of gates required tomake the input programs executable is much less than when usingthe default Qiskit compiler. When compiling our simple programsconsisting of a single Toffoli gate with qubits mapped in randomlocations, we reduce the average number of gates by 35% geomean.In Figure 7 we show 35 different triplets of hardware qubitsfor each of the four strategies. For each triplet, we note the totaldistance between the qubits on the hardware, given by the shortestpath distance in the underlying topology. Even when the distanceis relatively small, Trios outperforms reducing overall gate countand as the distance increases, the margin tends to increase. In thesmall distance cases, this can be attributed to Trios choosing thebetter Toffoli decomposition for a linearly connected topology. Thisis significant for two reasons. First, the fewer the gates, the lesslikely an error occurs due to qubit manipulation. Second, fewergates, especially long sequential chains of SWAPs, often meanslower circuit depth, meaning fewer chances for decoherence errors.Together this translates into faster and more successful programs.This advantage extends to our NISQ benchmarks which containvarious numbers of Toffoli gates. In Figure 10 we note substantialreductions in total gates across all benchmarks containing Toffoligates across all underlying topologies. The only exception is the twosmallest benchmarks (on 4 and 5 qubits) for the clustered topologybecause they could be compiled with zero SWAPs.An extreme of the clustered topology is a single cluster with all-to-all connected qubits. On this device, Orchestrated Trios wouldhave no benefit as operations can be performed between any pairof qubits so no SWAPs are needed and routing is trivial. However,as quantum technologies scale to more than a few qubits, fully-connected architectures hits physical limitations and must be re-engineered. As trapped ion qubit chains get longer, for example,gate operations become slower and lower fidelity. [24] showed thatthe optimal trap size is 15-25 ions interconnected similar to ourcluster model with cluster sizes of 15-25 where Trios does benefit.On average, for Toffoli-containing programs we reduce gatecount 37%, 36%, 48%, 26% for Johannesburg, Grid, Line, and Clustertopologies respectively with the maximum gain obtained for lineardevices.
In general, we expect programs with fewer total two-qubit gates,to succeed with higher probability. In devices with limited con-nectivity, the addition of routing operations like SWAPs, usuallydecomposed to 3 CNOTs, can severely reduce the chance an inputprogram can succeed. While success rate is inversely correlatedwith number of gates, gate error is not the only reason a programcan fail and reducing gate counts does not guarantee improvedsuccess rates.In Figure 6 we show the success rates of our Toffoli-only experi-ments when the two controls are initialized to | ⟩ and the target isinitialized to | ⟩ so we measure the probability of obtaining | ⟩ .These results are obtained from Johannesburg on 8/19/2020. Thex-axes of both Figures 6 and 7 line up to compare gate counts and re-sulting success rate. In general, experimentally, fewer gates resultsin substantial improvements to success rates. For example, a Toffolion (6-17-3) compiled with Trios improves success rate from around30% to over 50%. On average, we improve success rates by 23 %geomean with max of 286%. In Figure 8, we show improvementscompiled with Trios normalized to baseline for 99 different tripletsof varying total distance on Johannesburg.Trios on average improves the probability of success for thesecircuits. However, there are a small number of cases where Triosperforms worse despite having a smaller number of total gates.This can be attributed to several different factors. For example, thechosen edges for SWAP paths may be more noisy, or on pairs ofedges with greater crosstalk, or the final qubits which are measuredhave worse readout error. Regardless, reducing the overall gatecount of a program is an important contributing factor to improvingexpected success rate.For our simulated NISQ benchmarks, we see even larger gains.The reduced gate counts in Figure 10 translate to major improve-ments in simulated success rate in Figure 9 (normalized successrates in Figure 11). For example, in cnx_logancilla-19 , Trios morethan doubles the expected success rates when compiled to eachof the architectures. In many cases, the expected success rate ofprograms compiled with Qiskit is effectively zero while Trios has arealistic chance of obtaining the correct answer. As expected, onprograms containing no Toffoli gates, Trios has no effect on successshowing that it introduces no excessive overhead. This suggestsTrios can easily be added to other quantum compilation toolflows. Trios improves gate counts, and consequently improves successrates, by routing more efficiently and choosing more appropriateToffoli decompositions based on the underlying architecture’s con-nectivity. Current compilers, like Qiskit, perform routing on fullydecomposed and unrolled programs, and while this must eventuallybe done, it leads to less efficient routing policies and relies on as-sumptions that a theoretically good decomposition (fewer CNOTs)is the best decomposition for the hardware. Trios eliminates this bychoosing a context-dependent Toffoli decomposition and routingmultiqubit gates as single units.Trios greatly improves effectiveness compared to a heuristic-based compiler by applying similar heuristics to the higher ab-straction level Toffoli gates. An optimal routing of the decomposed7
SPLOS ’21, April 19–23, 2021, Virtual, USA Casey Duckering, Jonathan M. Baker, Andrew Litteken, and Frederic T. Chong ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) g e o - m e a n . . . . s u cc e ss p r o b a b i l i t y Toffoli Experiment on IBMQ Johannesburg
Qiskit (baseline) Qiskit (8-CNOT Toffoli) Trios (6-CNOT Toffoli) Trios (8-CNOT Toffoli)
Figure 6: Success probabilities of Toffoli gates between random triplets of qubits. Higher is better. The x labels specify the threequbits and total swap distance. The geometric mean success rates for each compiler are 41%, 35%, 47%, and 50% respectively.Trios (8-CNOT) improves average success rate by 23% vs. the Qiskit baseline. ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) ( - - ) g e o - m e a n C N O T g a t e c o u n t Toffoli Experiment on IBMQ Johannesburg
Qiskit (baseline) Qiskit (8-CNOT Toffoli) Trios (6-CNOT Toffoli) Trios (8-CNOT Toffoli)
Figure 7: Total number of two-qubit (CNOT) gates required to execute a Toffoli gate between various distant qubits. Lower isbetter. The x labels specify the three qubits and total swap distance. The geometric mean gate counts for each compiler are 29,28, 23, and 19 respectively. Trios (8-CNOT) reduces average gate count by 35%. 𝑝 𝑡 𝑟 𝑖 𝑜 𝑠 / 𝑝 𝑏 𝑎 𝑠 𝑒 𝑙 𝑖 𝑛 𝑒 Toffoli Experiment Success Normalized to Baseline
Trios Qiskit (reference at 100%)
Figure 8: Normalized success probabilities of Toffoli gates between triplets of qubits. Higher is better. Bars below 100% indicatelower success rate for Trios. The geometric mean increase in success rate is 23%. The x labels indicate the qubit distance for arange of bars. rchestrated Trios: Compiling for Efficient Communication in Quantum Programs with 3-Qubit Gates ASPLOS ’21, April 19–23, 2021, Virtual, USA c n x _ d i r t y - c n x _ h a l f b o r r o w e d - c n x _ l o g a n c i l l a - c n x _ i n p l a c e - c u c c a r o _ a d d e r - t a k a h a s h i _ a d d e r - i n c r e m e n t e r _ b o r r o w e d b i t - g r o v e r s - g e o m e t r i c m e a n q f t _ a d d e r - b v - q a o a _ c o m p l e t e - . . . . s u cc e ss p r o b a b i l i t y Simulated Benchmark Success Probability
Baseline Trios (ibmq) Baseline Trios (grid) Baseline Trios (line) Baseline Trios (clusters)
Figure 9: Simulated upper-bounds on the program execution success probability on various hardware (using 20x lower idleand gate errors than Johannesburg). Neighboring pairs of bars compare the baseline with Trios compiled for Johannesburg.Higher is better when comparing pairs of bars with the same color. The geometric mean success rates over the benchmarksthat use Toffoli gate for each device type respectively are 2.2% → → → → c n x _ d i r t y - c n x _ h a l f b o r r o w e d - c n x _ l o g a n c i l l a - c n x _ i n p l a c e - c u c c a r o _ a d d e r - t a k a h a s h i _ a d d e r - i n c r e m e n t e r _ b o r r o w e d b i t - g r o v e r s - g e o m e t r i c m e a n q f t _ a d d e r - b v - q a o a _ c o m p l e t e - p e r c e n t f e w e r C N O T g a t e s Simulated Benchmark Gate-Count Reduction over Baseline ibmq-johannesburg full-grid-5x4 line-20 clusters-5x4
Figure 10: A comparison between the baseline and Trios for various hardware. Above 0% indicates benefit. All two-qubit gates(for communication and computation) are counted. The geometric mean reductions in gate counts are 37%, 36%, 48%, and 26%respectively. The rightmost three benchmarks contain zero Toffoli gates so have no change vs. the baseline. c n x _ d i r t y - c n x _ h a l f b o r r o w e d - c n x _ l o g a n c i l l a - c n x _ i n p l a c e - c u c c a r o _ a d d e r - t a k a h a s h i _ a d d e r - i n c r e m e n t e r _ b o r r o w e d b i t - g r o v e r s - g e o m e t r i c m e a n q f t _ a d d e r - b v - q a o a _ c o m p l e t e - 𝑝 𝑡 𝑟 𝑖 𝑜 𝑠 / 𝑝 𝑏 𝑎 𝑠 𝑒 𝑙 𝑖 𝑛 𝑒 Simulated Benchmark Success Normalized to Baseline ibmq-johannesburg full-grid-5x4 line-20 clusters-5x4
Figure 11: Normalized Figure 9 to show our consistent increase in program success with Trios. Above indicates benefit.Some improvement factors are huge due to near-zero baseline success rates. The geometric mean increases in success rate are4.4x, 3.7x, 31x, and 2.3x respectively. The rightmost three benchmarks contain zero Toffoli gates so have no change vs. thebaseline. SPLOS ’21, April 19–23, 2021, Virtual, USA Casey Duckering, Jonathan M. Baker, Andrew Litteken, and Frederic T. Chong error rate improvement factor 𝑝 𝑡 𝑟 𝑖 𝑜 𝑠 / 𝑝 𝑏 𝑎 𝑠 𝑒 𝑙 𝑖 𝑛 𝑒 Sensitivity to Device Error Rates cnx_halfborrowed-19takahashi_adder-20cuccaro_adder-20grovers-9incrementer_borrowedbit-5cnx_logancilla-19cnx_inplace-4cnx_dirty-11for experimentfor benchmarks
Figure 12: Factor of improvement in success rate in Triosover baseline for scaling gate error rates. The dotted lineindicates current error rates on IBM Johannesburg and thedashed line (20x improvement) indicates values of the nearfuture used in simulation. In our approximation of successrate factors of improvement in gate error rates lead to an ex-ponential fall off in success ratios, as expected. In the verynear term, we expect Trios to drastically improve the execu-tion of quantum programs. circuit would be better except it cannot select the best architec-ture location-specific decomposition. This makes a huge differencespecifically with Toffolis on any square-grid-based device. Onemight choose to improve the solution found by an optimal com-piler by always decomposing Toffolis to the 8-CNOT version beforeoptimally routing, but this will still limit the solution. There aremultiple possible qubit orders for the decomposition and the bestcan only be selected after the routing pass.
For our simulations we use an error model (20x better than cur-rent errors on Johannesburg) which is forward looking. As errorsimprove, we expect Trios to have a reduced impact on program suc-cess rates since gate errors will contribute less and less to program failure though Trios will never perform worse than the baseline. InFigure 12 we study the sensitivity of simulation results to two qubiterror rates beginning with current IBM error rates. For poor errorrates, the benefit of Trios is extremely large, owed to the fact thatprograms compiled with the baseline have probabilities of successvery close to 0. In our simplified simulation framework, as errorrates improve we expect an exponential drop off in improvementwith the most advantage obtained with current error rates.
We present a new quantum compilation structure, Trios, with asplit decomposition pass to greatly reduce compiled communicationcost and enable architecture-tuned decompositions. We specificallytarget the three-qubit Toffoli operation to capture program structureenabling more optimal compiled circuits. Because current quantumcomputers are especially error prone, they require high levels ofoptimization to reduce gate counts and maximize the probabilitythe compiled program will succeed.Orchestrated Trios both greatly improves the effectiveness ofqubit routing given newly exposed program structure and improvesdecompositions with connectivity-awareness. These both greatlybenefit the program success rate, a critical metric for today’s error-prone and resource-constrained quantum computers. We hope thisinspires more hierarchically designed NISQ algorithms now that wehave shown breaking the abstractions of discrete compilation passescan help bridge the gap between these noisy quantum hardwareand practical applications.
ACKNOWLEDGMENTS
This work is funded in part by EPiQC, an NSF Expedition in Com-puting, under grants CCF-1730449; in part by STAQ under grantNSF Phy-1818914; in part by DOE grants DE-SC0020289 and DE-SC0020331; and in part by NSF OMA-2016136 and the Q-NEXTDOE NQI Center.This research used resources of the Oak Ridge Leadership Com-puting Facility, which is a DOE Office of Science User Facility sup-ported under Contract DE-AC05-00OR22725.Disclosure: F. Chong is also Chief Scientist at Super.tech and anadvisor to Quantum Circuits, Inc.10 rchestrated Trios: Compiling for Efficient Communication in Quantum Programs with 3-Qubit Gates ASPLOS ’21, April 19–23, 2021, Virtual, USA
REFERENCES
Algorithms for the Optimization of Quantum Circuits . Mas-ter’s thesis. University of Waterloo. http://hdl.handle.net/10012/7818[5] Jonathan M. Baker, Casey Duckering, Pranav Gokhale, and Andrew Lit-teken. 2020. Quantum Circuit Benchmarks. https://github.com/jmbaker94/quantumcircuitbenchmarks.[6] Jonathan M. Baker, Casey Duckering, Alexander Hoover, and Frederic T. Chong.2019. Decomposing Quantum Generalized Toffoli with an Arbitrary Number of Ancilla. arXiv preprint (April 2019). arXiv:1904.01671[7] Jonathan M. Baker, Casey Duckering, Alexander Hoover, and Frederic T. Chong.2020. Time-Sliced Quantum Circuit Partitioning for Modular Architectures. In
Proceedings of the 17th ACM International Conference on Computing Frontiers .98–107. https://doi.org/10.1145/3387902.3392617 arXiv:2005.12259[8] Adriano Barenco, Charles H. Bennett, Richard Cleve, David P. DiVincenzo, Nor-man Margolus, Peter Shor, Tycho Sleator, John A. Smolin, and Harald Weinfurter.1995. Elementary gates for quantum computation.
Physical Review A
52, 5(Nov. 1995), 3457–3467. https://doi.org/10.1103/PhysRevA.52.3457 arXiv:quant-ph/9503016[9] Ethan Bernstein and Umesh Vazirani. 1993. Quantum Complexity Theory. In
Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing (San Diego, California, USA) (STOC ’93) . 11–20. https://doi.org/10.1145/167088.167097[10] Alexander Cowtan, Silas Dilkes, Ross Duncan, Alexandre Krajenbrink, WillSimmons, and Seyon Sivarajah. 2019. On the Qubit Routing Problem. 135 (Feb.2019), 5:1–5:32. https://doi.org/10.4230/LIPIcs.TQC.2019.5 arXiv:1902.08091[11] Steven A. Cuccaro, Thomas G. Draper, Samuel A. Kutin, and David Petrie Moulton.2004. A new quantum ripple-carry addition circuit. arXiv preprint (Oct. 2004).arXiv:quant-ph/0410184[12] Casey Duckering, Jonathan M. Baker, and Andrew Litteken. 2021. Source Codefor Orchestrated Trios. https://github.com/cduck/orchestrated-trios.[13] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A Quantum Approxi-mate Optimization Algorithm. arXiv preprint (Nov. 2014). arXiv:1411.4028[14] Craig Gidney. [n.d.]. Constructing Large Controlled Nots. https://algassert.com/circuits/2015/06/05/Constructing-Large-Controlled-Nots.html[15] Lov K. Grover. 1996. A fast quantum mechanical algorithm for database search. In
Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing .212–219. https://doi.org/10.1145/237814.237866 arXiv:quant-ph/9605043[16] Gian Giacomo Guerreschi and Jongsoo Park. 2018. Two-step approach to sched-uling quantum circuits.
Quantum Science and Technology
3, 4 (July 2018), 045003.https://doi.org/10.1088/2058-9565/aacf0b arXiv:1708.00023[17] Yuichi Hirata, Masaki Nakanishi, Shigeru Yamashita, and Yasuhiko Nakashima.2011. An efficient conversion of quantum circuits to a linear nearest neighborarchitecture.
Quantum Information and Computation
11, 1 (Jan. 2011), 142–166.[18] ibm0 [n.d.]. IBM Quantum Devices. https://quantumexperience.ng.bluemix.net/qx/devices. Accessed: 2018-05-16.[19] Ali JavadiAbhari, Shruti Patil, Daniel Kudrow, Jeff Heckey, Alexey Lvov, Fred-eric T. Chong, and Margaret Martonosi. 2015. ScaffCC: Scalable compila-tion and analysis of quantum programs.
Parallel Comput.
45 (2015), 2–17.https://doi.org/10.1016/j.parco.2014.12.001 arXiv:1507.01902[20] Philip Krantz, Morten Kjaergaard, Fei Yan, Terry P. Orlando, Simon Gustavsson,and William D. Oliver. 2019. A quantum engineer’s guide to superconductingqubits.
Applied Physics Reviews
6, 2 (June 2019), 021318. https://doi.org/10.1063/1.5089550 arXiv:1904.06560[21] Dmitri Maslov, Gerhard W. Dueck, D. Michael Miller, and Camille Negrevergne.2008. Quantum Circuit Simplification and Level Compaction.
IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems
27, 3 (March 2008),436–444. https://doi.org/10.1109/TCAD.2007.911334 arXiv:quant-ph/0604001[22] Steven Moses, Juan Pino, Joan Dreiling, Caroline Figgatt, John Gaebler, MichaelAllman, Charles Baldwin, Michael Foss-Feig, David Hayes, Karl Mayer, CiaranRyan-Anderson, and Brian Neyenhuis. 2020. Demonstration of the QCCD trapped-ion quantum computer architecture.
Bulletin of the American Physical Society (June 2020). arXiv:2003.01293[23] Prakash Murali, Jonathan M. Baker, Ali Javadi-Abhari, Frederic T. Chong, andMargaret Martonosi. 2019. Noise-Adaptive Compiler Mappings for NoisyIntermediate-Scale Quantum Computers. In
Proceedings of the Twenty-FourthInternational Conference on Architectural Support for Programming Languagesand Operating Systems . 1015–1029. https://doi.org/10.1145/3297858.3304075arXiv:1901.11054[24] Prakash Murali, Dripto M. Debroy, Kenneth R. Brown, and Margaret Martonosi.2020. Architecting Noisy Intermediate-Scale Trapped Ion Quantum Computers.In . 529–542. https://doi.org/10.1109/ISCA45697.2020.00051 arXiv:2004.04706[25] Prakash Murali, David C. McKay, Margaret Martonosi, and Ali Javadi-Abhari.2020. Software Mitigation of Crosstalk on Noisy Intermediate-Scale QuantumComputers. In
Proceedings of the Twenty-Fifth International Conference on Archi-tectural Support for Programming Languages and Operating Systems . 1001–1016.https://doi.org/10.1145/3373376.3378477 arXiv:2001.02826[26] Yunseong Nam, Neil J. Ross, Yuan Su, Andrew M. Childs, and Dmitri Maslov. 2018.Automated optimization of large quantum circuits with continuous parameters. npj Quantum Information
4, 1 (March 2018), 1–12. https://doi.org/10.1038/s41534-018-0072-4 arXiv:1710.07345[27] Michael A. Nielsen and Isaac L. Chuang. 2011.
Quantum Computation andQuantum Information: 10th Anniversary Edition . Cambridge University Press,New York, NY, USA.[28] John Preskill. 2018. Quantum Computing in the NISQ era and beyond.
Quantum SPLOS ’21, April 19–23, 2021, Virtual, USA Casey Duckering, Jonathan M. Baker, Andrew Litteken, and Frederic T. Chong [29] Lidia Ruiz-Perez and Juan Carlos Garcia-Escartin. 2017. Quantum arithmeticwith the quantum Fourier transform.
Quantum Information Processing
16, 6 (April2017), 152:1–152:14. https://doi.org/10.1007/s11128-017-1603-1 arXiv:1411.5949[30] Zahra Sasanian and D. Michael Miller. 2012. Reversible and Quantum CircuitOptimization: A Functional Approach. In
International Workshop on ReversibleComputation . Springer Berlin Heidelberg, 112–124. https://doi.org/10.1007/978-3-642-36315-3_9[31] Norbert Schuch. 2002. Implementation of quantum algorithms with Josephsoncharge qubits.
Universität Regensburg (Dec. 2002). https://epub.uni-regensburg.de/1511/[32] Peter W. Shor. 1997. Polynomial-Time Algorithms for Prime Factorization andDiscrete Logarithms on a Quantum Computer.
SIAM J. Comput.
26, 5 (1997),1484–1509. https://doi.org/10.1137/S0097539795293172 arXiv:quant-ph/9508027[33] Marcos Yukio Siraichi, Vinícius Fernandes dos Santos, Sylvain Collange, andFernando Magno Quintão Pereira. 2018. Qubit allocation. In
Proceedings of the2018 International Symposium on Code Generation and Optimization . 113–125.https://doi.org/10.1145/3168822[34] Robert S. Smith, Michael J. Curtis, and William J. Zeng. 2016. A Practical QuantumInstruction Set Architecture. arXiv preprint (2016). arXiv:1608.03355[35] Yasuhiro Takahashi, Seiichiro Tani, and Noboru Kunihiro. 2009. Quantum Addi-tion Circuits and Unbounded Fan-Out. arXiv preprint (Oct. 2009). arXiv:0910.2530 [36] Bochen Tan and Jason Cong. 2020. Optimal Layout Synthesis for QuantumComputing. In . IEEE. arXiv:2007.15671[37] Swamit S. Tannu and Moinuddin Qureshi. 2019. Ensemble of Diverse Mappings:Improving Reliability of Quantum Computers by Orchestrating Dissimilar Mis-takes. In
Proceedings of the 52nd Annual IEEE/ACM International Symposium onMicroarchitecture . 253–265. https://doi.org/10.1145/3352460.3358257[38] Robert Wille, Lukas Burgholzer, and Alwin Zulehner. 2019. Mapping QuantumCircuits to IBM QX Architectures Using the Minimal Number of SWAP and HOperations. In . IEEE.arXiv:1907.02026[39] Robert Wille, Oliver Keszocze, Marcel Walter, Patrick Rohrs, Anupam Chat-topadhyay, and Rolf Drechsler. 2016. Look-ahead Schemes for Nearest NeighborOptimization of 1D and 2D Quantum Circuits. In . IEEE, 292–297. https://doi.org/10.1109/ASPDAC.2016.7428026[40] Robert Wille, Aaron Lye, and Rolf Drechsler. 2014. Optimal SWAP gate insertionfor nearest neighbor quantum circuits. In . IEEE, 489–494. https://doi.org/10.1109/ASPDAC.2014.6742939. IEEE, 489–494. https://doi.org/10.1109/ASPDAC.2014.6742939