[PDF] SlackQ : Approaching the Qubit Mapping Problem with A Slack-aware Swap Insertion Scheme

Abstract

The rapid progress of physical implementation of quantum computers paved the way for the design of tools to help users write quantum programs for any given quantum device. The physical constraints inherent in current NISQ architectures prevent most quantum algorithms from being directly executed on quantum devices. To enable two-qubit gates in the algorithm, existing works focus on inserting SWAP gates to dynamically remap logical qubits to physical qubits. However, their schemes lack consideration of the execution time of generated quantum circuits. In this work, we propose a slack-aware SWAP insertion scheme for the qubit mapping problem in the NISQ era. Our experiments show performance improvement by up to 2.36X at maximum, by 1.62X on average, over 106 representative benchmarks from RevLib, IBM Qiskit , and ScaffCC.

Full PDF

SSlackQ : Approaching the Qubit Mapping Problemwith A Slack-aware Swap Insertion Scheme

Chi Zhang ∗‡ Yanhao Chen † Yuwei Jin † Wonsun Ahn ∗ Youtao Zhang ∗ Eddy Z. Zhang †§∗

University of Pittsburgh † Rutgers University ‡ [email protected] § [email protected] Abstract —The rapid progress of physical implementation ofquantum computers paved the way for the design of tools to helpusers write quantum programs for any given quantum device.The physical constraints inherent in current NISQ architecturesprevent most quantum algorithms from being directly executedon quantum devices. To enable two-qubit gates in the algorithm,existing works focus on inserting SWAP gates to dynamicallyremap logical qubits to physical qubits. However, their schemeslack consideration of the execution time of generated quantumcircuits. In this work, we propose a slack-aware SWAP insertionscheme for the qubit mapping problem in the NISQ era. Ourexperiments show performance improvement by up to 2.36Xat maximum, by 1.62X on average, over 106 representativebenchmarks from RevLib [1], IBM Qiskit [2], and ScaffCC [3].

I. I

NTRODUCTION

Quantum computing has been considered as a potentiallydisruptive computation model. In 2019, Google [4] demon-strated “Quantum Supremacy” with its 54-qubit quantumprocessor that is able to perform a computational task in 200seconds which would have taken the state-of-art classical su-percomputer 10,000 years. In general, quantum computing hassigniﬁcant advantage over classical computing for applicationsincluding large number factoring [5], database search [6], andquantum simulation [7].Labs in academia and industry are now able to buildquantum computers with up to 49-72 qubits. IBM [8] releasedits 53-qubit quantum computer in October 2019 and has madeit available for commercial use. Google [9] released the 72-qubit

Bristlecone quantum computer in March 2018. Intel [10]and Rigetti [11] respectively have released quantum computingdevices with dozens of qubits. Further, a few small-scalequantum computers with less than 20 qubits are made freelyavailable to the public [12], for example, the series of quantumcomputers provided by IBM Q experience [12].The physical constraints inherent in quantum architecturesprevent quantum algorithms from being directly executedon the device. One of the major constraints that must beaccounted for before quantum algorithms can be executed isthe qubit connectivity constraint. In the superconducting-basedquantum computers (the implementation adopted by majorindustry players such as IBM and Google), qubits are notfully connected. It follows nearest neighbor (NN) interaction model, enforced by the connectivity of the physical qubitsarray. If an algorithm requires communication between qubitsthat are not physically connected, the algorithm cannot bedirectly executed on the device.To solve the qubit connectivity problem, any two logicalqubits that need to communicate according to the algorithmmust be mapped to physical qubits on the device that areneighboring (connected). This is done through one or a se-quence of SWAP operation(s). A SWAP operation exchangesthe states of two neighboring qubits, in effect “moving” thetwo qubits. An example is shown in Fig. 1. This dynamicremapping between logical and physical qubits may need tohappen multiple times throughout the algorithm.Inserting swap operations inevitably results in increasedgate count and execution time. Previous studies [13], [14],[15], [16], [2], [17] focus on optimizing gate count but notexecution time. Execution time is an important measure ofthe performance of a circuit. Minimal gate count do notnecessarily guarantee minimal time. We show an example inFig. 1 where two qubit mapping solutions yield the same gatecount but only one of them is optimal in time.Optimizing the execution time of a circuit is important notonly for optimizing the performance but also for improvingthe ﬁdelity of a quantum circuit. Quantum computers are notperfect. Qubits are ﬁckle and error prone. As time goes by,a qubit decoheres and error accumulates. The time a qubitcan survive without losing its state information with highprobability is called coherence time . The longer a circuithas to execute, the more likely it will approach a qubit’scoherence time. IBM proposes the metric of quantum volume [18] for evaluating the effectiveness of quantum computers.One important factor for calculating quantum volume is themaximum depth of a circuit that can be executed by aquantum computer before accumulating a certain amount oferror. Here the depth represents the amount of time a circuitexecutes. Optimizing the depth of a circuit is important as onlycircuits that ﬁt into the quantum volume can run successfullyand generate meaningful computational results. Thus quantumcompilers must take the execution time of the generated circuitinto consideration.. a r X i v : . [ c s . ET ] S e p H H S q1(Q1)q2(Q2)q3(Q3)q4(Q4)q5(Q5)

H H S q1(Q1)q2(Q2)q3(Q3)q4(Q4)q5(Q5)

H H S

Q1 Q2Q3Q4 Q5 (Q3)(Q2)

XXXX (Q5)(Q3)(a) IBM QX2 (b) (c) (d) g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6

Fig. 1: (a) Physical qubit connectivity; (b) the original logical circuit (logical qubits q1, q2, q3, q4, q5 are mapped to physicalqubits Q1, Q2, Q3, Q4, Q5); The gate marked in red is the CNOT gate that cannot be executed due to a connectivity constraint(as Q2 and Q5 are not physically connected). (c) uses 1 swap but the execution time of the circuit is not increased; (d) uses1 swap but the execution time of the circuit is increased. The operations marked in blue are swap operations. We assume aswap operation can be decomposed into three CNOT gates and each gate takes 1 cycle in this example.In this paper, we focus on time-aware qubit mapping. Agood time-aware qubit mapper needs to yield a hardware-compliant circuit while having optimal or near-optimal ex-ecution time. We discover the key is to ﬁnd intervals withslack in the circuit and to use the slack to hide the latencyof inserted swap operations. We present important consid-erations for detecting and exploiting slack in the circuit.Our implemented qubit mapper named

SlackQ automaticallysearches for dynamic qubit mappings given an input programon a quantum architecture with arbitrary qubit connectivity.The experiments show that SlackQ improves performance byup to 2.36X, by 1.62X on average, over 106 representativebenchmarks from RevLib [1], IBM Qiskit [2], and ScaffCC[3]. II. B

ACKGROUND AND M OTIVATION

A. Quantum Computing Basics1) Qubit:

A quantum bit or qubit, is the counterpart toa classical bit in the realm of quantum computing. Differentfrom a classical bit that represents either ‘1’ or ‘0’, a qubit isin the coherent superposition of both states. The state — ψ ¿associated with a qubit is a unit vector in a two-dimensionalvector space. The state of a qubit can be represented as | ψ > = α | > + β | > = (cid:20) αβ (cid:21) , where α and β are two complex numbers such that | α | + | β | = 1 . α and β are called amplitudes. Upon the standardmeasurement, the state — ψ ¿ will collapse into the basis state—0¿ with probability | α | or the basis state —1¿ with proba-bility | β | . A system of n qubits encodes a state superpositionof n basis vectors with n amplitudes. The classical n-bitsystem encodes the information of one basis vector in thevector space, but n-qubit system encode the information of n -dimensional vector space. Operating on one n-qubit stateis as if operating on n complex numbers at one time. This isone of the reasons for the potential exponential speedup usingquantum computing.

2) Quantum Gates:

There are two types of elementaryquantum gates. One is the single-qubit gate, which is a unitaryquantum operation that can be abstracted as the rotation aroundthe axis of the Bloch sphere [19]. A single-qubit gate can alsobe represented using a 2 by 2 unitary matrix. Important single-qubit gates include the H (Hadamard) gate, and the S (phaseshift by π/ ) gate [20].The second type of gate is the multi-qubit gate. Thecontrolled-NOT (CNOT) is a two-qubit gate that performs themost important role (arguably) in quantum computation. Thetwo qubits involved in a CNOT gate are: the control qubitand the target qubit. If the control qubit is 0, it leaves thetarget qubit unchanged. If it is 1, it applies a NOT gate to thetarget qubit. The CNOT gate entangles qubits and allow qubitsto communicate. The CNOT gate, H gate, S gate, and T gatetogether form a universal set called the Clifford+T library. Anyquantum algorithm can be implemented using a compositionof gates from the universal set.

3) Quantum Circuit:

A quantum algorithm can be ex-pressed as a quantum circuit which is composed of a set ofqubits and a sequence of quantum operations on these qubits.A quantum circuit can be thought of a quantum algorithm in“assembly language”. There are two different ways to describethe quantum circuits. One way is to use the quantum assemblylanguage called OpenQASM [21] released by IBM. The otherway is to use a circuit diagram, in which qubits are representedas horizontal lines. Input is the on the left and output is onthe right. Unlike a classical circuit, a quantum circuit musthave the same number of input and output qubits. Fig. 1 (b)shows an example quantum circuit diagram. Logical qubits aredenoted using lowercase letters (q1, q2, ...) and physical qubitsare denoted using uppercase letters (Q1, Q2, ...). Initially,logical qubits q1, q2, q3, q4, and q5 are mapped to physicalqubits Q1, Q2, Q3, Q4, and Q5. A single-qubit gate is denotedas a square on the line. A CNOT gate is represented as a lineconnecting two qubits where the control qubit is marked witha dot and the target qubit with a ⊕ sign. In this paper, we usethe circuit diagram representation to describe examples. . Qubit Mapping Problem To enable the execution of a quantum circuit, the logicalqubits in the circuit must be mapped to the physical qubit onthe target hardware. When applying a CNOT gate, the twological qubits involved in the CNOT gate must be mappedto two physical qubits connected to each other. Due to theirregular layout and connectivity of the qubits in the targetdevice, it is sometimes impossible to ﬁnd an initial mappingthat makes the entire circuit CNOT-compliant. The commonpractice is to insert SWAP operations to remap the logicalqubits, whenever a CNOT gate cannot be applied.A SWAP operation exchanges the states of the two inputqubits of interest. As shown in Fig. 2 (b), a swap operationis implemented using 3 CNOT gates for architectures with bi-directional links, where a bi-directional link means both endsof the link can be the control or target qubit. Or it can beimplemented using 3 CNOT gates plus 4 Hadamard gates forarchitectures with single-direction links as shown in Fig. 2 (c),where a single-direction link means only one end of the linkcan be the control qubit.The qubit mapping problem takes a logical circuit and ahardware coupling graph as input and outputs a transformedcircuit that ﬁts on the hardware device by inserting SWAPoperations. After the transformation, all CNOT gates mustbe performed on qubits that are connected in the physicalarchitecture. Due to the swaps, a logical qubit may be mappedto different physical qubits at different points in the circuitexecution. But, at any given point, a logical qubit will bemapped to exactly one physical qubit since we are only usingswaps to move qubits and are not making any copies. XX HH HH (a) (b) (c) mn nm mn mnnm nm= =

Fig. 2: Implementation of a SWAP operation: (a) the SWAPnotation, where m and n are two logical qubits, after SWAP, mand n exchanged their states, (b) for bidirectional links, wherethe three CNOT that implement the SWP do not need to usethe same control qubit, and (c) for single direction links, wherethe three CNOT must use the same control qubit.An example of circuit transformation is shown in Fig. 1where (a) is the physical connectivity, (b) is the original circuit,(c) and (d) offer two different hardware-compliant circuitsgenerated from the same original circuit.

C. Parallelism in Quantum Circuit

Like in classical computers, parallelism is also importantin quantum computers. Parallelism comes from independentoperations on different qubits. Gates on the same qubit have torun sequentially. For instance, if a and b are two consecutivegates on the same qubit, and a is before b in the program,then gate b depends on a . Gates that do not share any qubit are independent. A two-qubit gate depend on up to two gatessince it involves two qubits. A two-qubit gate has up to twoimmediate successors in the dependence graph. A dependencegraph can be built with respect to the partial order betweengates. It is a directed acyclic graph (DAG).In a transformed circuit, the parallelism could be (1) be-tween the gates in the original circuit (as g2 and g3 in Fig.1 (b)), (2) between the SWAP gates that are inserted into theoriginal circuit, and (3) between a gate in the original circuitand a newly inserted gate (as swap 3,5 and g1 in Fig. 1(c)).A good qubit mapping algorithm should consider all typesof parallelism. However, existing studies only consider theﬁrst two types of parallelism. Our work is the ﬁrst one thatsystematically exploits all type of parallelism.As shown in Fig. 1, the best two known approaches byZulehner et al. [16] and Li et al. [13] do not distinguishsolution (c) from solution (d) as the two solutions both insert1 SWAP. And [16] and [13] only optimize the number ofgates inserted into the circuit (or the parallelism of the insertedgates), but not the parallelism of the transformed circuit. Thesolution in Fig. 1 (c) is better than Fig. 1 (d) as the insertedswap can run in parallel with the gates in the original circuit.This example stresses the importance of time-awareness inSWAP insertion schemes and motivates our work.III. I NSIGHT AND D ESIGN

To improve the parallelism between the inserted SWAPoperation and the gates in the original circuit, we discoverthat it is important to exploit the slack intervals in the circuit.The slack represents the idle time in the original circuit for agiven set of qubits. The key is to hide the latency of insertedswap operations by using the qubits that are idle at that timeof the circuit execution. This forms the main idea of this paperand we insert SWAP operations such that they leverage slack in the circuit as much as possible.

A. Slack

We deﬁne slack as the idle time between two consecutivegates on the same qubit(s) and can be used to perform SWAPoperation without affecting the total execution time of theentire circuit. The slack time is usually caused by dependencebetween gates and/or variation of gate count on individualqubits.The slack time due to dependence between gates onlyoccurs when there are two-qubit gates in the circuit. Recall thata CNOT gate depends on up to two other gates, since CNOT isa two-qubit gate. If the qubits are running at different speeds,one of the other qubits might be ready earlier than the other.The faster qubit thus needs to wait for the slow qubit to ﬁnishbefore the CNOT gate can be executed. On the other hand,if a circuit has a number of qubits, and the number of gateson each qubit is different (even if they are all independent),then some qubits will inevitably be idle at some point of theexecution. The slack intervals can be used for inserting swapoperations that resolve qubit mapping constraints. An exampleof slack in the circuit is shown in Fig. 3. q q q q S H g3g2g1 slack

Fig. 3: Slack in the circuit: Since g3 depends on g1 and g2 , g1 ﬁnishes earlier than g2, therefore, for qubit q3 , there is aslack interval of three cycles (assuming each gate takes onecycle) between g1 and g3 in the circuit.There are two types of slacks in the circuit. One type doesnot require the rescheduling of the gates, and we deﬁne itas ﬁxed slack . The other type of slacks may have variablenumber of cycles, and we denote it as ﬂexible slack . A goodqubit mapper needs to search globally and exploit both ﬁxed and ﬂexible slack . a) Fixed Slack : An example of ﬁxed slack is shownin Fig. 4 (b). Assuming each gate takes one cycle, there is aﬁxed slack between g1 and g2 on qubit q2 . Here it cannotdelay g2 or start g1 early if the total execution time needs toremain unchanged. If qubit q2 is used to perform another gatesuch as the swap operation during the three cycles, it will notaffect the execution time of the entire circuit. In this case, thenumber of cycles that can be used on q2 between g1 and g2is ﬁxed. b) Flexible Slack : Sometimes ﬁxed slack does not al-ways exist. It is necessary to move the gates in order to createslacks for latency hiding purpose. We show an example Fig.4 (c) and (d), where slack can be created by moving g . Let’ssay cnot( q , q ) and cnot( q , q ) are scheduled on cycle 1. Thethree single-qubit gates on Q are scheduled on cycle 2,3,4respectively. With this going on, g expects to be executed oncycle 5 at the earliest. g depends on g . g can be scheduledat the second cycle, the third cycle (Fig. 4 (c) ) or the fourthcycle (Fig. 4 (d)) without delaying g . To this end, a slackwith zero, one or two cycles can be created between g and g , depending on when g is scheduled. And this type of slackbetween g and g is ﬂexible. On the other hand, since g isnot directly executable due to the connectivity constraint inFig. 4 (a), the more slack intervals before g there are, thebetter it is for hiding the swap latency. In Fig. 6, we showthat by moving g forward, q2 and q3 can have more slackintervals before g , and swap(3,4) is inserted which utilizesthe slack, resulting in a total circuit time of 6 cycles only,which is optimal in this case.It is worth mentioning ﬂexible slack could be cascading asthe rescheduling of one gate might affect its descendants orpredecessors. For the ﬁxed slack, the gates involved cannotbe delayed without affecting the circuit time. Flexible slackallows one or multiple gates to delay start within reasonabletime window(s). Flexible slack are more complicated thanﬁxed slack. It is necessary to analyze and exploit ﬂexible slack in a systematic way. B. Dynamic Gate Scheduler

We model the resolution of qubit mapping conﬂicts as adynamic scheduling process. Gates in the circuit are scheduledas soon as their dependencies are resolved. When a gatecannot be scheduled due to a connectivity problem, we insert a(combination of) swap(s) to change the qubit mapping so thatthe gate can be executed on the physical device. All the gatesthat have already been scheduled at one point of schedulingare called the

Processed Circuit , and the gates that still awaitscheduling are called the

Remaining Circuit .Fig. 6 shows an example of how the scheduling works.With initial mapping of { q , q , q , q } → { Q , Q , Q , Q } ,the ﬁrst two CNOTs cnot( q , q ) and cnot( q , q ) and thethree single-qubit gates on Q can be scheduled withoutremapping. At this point, those gates that are scheduled arepart of the Processed Circuit . The remaining two CNOTgates ( g and g ) that cannot be scheduled are part of the Remaining Circuit . Gate g cannot be scheduled because Q2and Q3 are not connected in the device. Gate g cannot bescheduled because g must be scheduled before g (write-after-read dependency on Q2). The dashed lines divide thecircuit into processed part and remaining part.To minimize the circuit time, we search for swap candi-dates for g that results in maximally hiding swap latenciesusing circuit slack. Fig. 5 shows the key idea behind thesearching for optimal swap candidates. The search revealsmultiple hardware-compliant candidates that utilize differentsequences of swaps to achieve compliance. We choose theoptimal candidate by calculating the Slack Utilization of eachcandidate, and choosing the one with the best utilization. InFig. 6, we choose swap candidate swap(q3, q4) since it besthides the swap latency behind the 2-cycle slack shown in Fig.4 (d). Now g is satisﬁed and scheduling can proceed. C. Critical Gates

The gates in the remaining circuit pending scheduling whosedependences have been resolved but connectivity problemshaven’t been can be divided into two groups: those on thecritical path and those that are not. We denote the gates on thecritical path as critical gates , and the others as non-criticalgates . In parallel computing, the critical path length is equalto the execution time when there is enough parallelism. Inthis case, the critical path is equal to execution time as themaximum parallelism (the maximum number of gates that canrun concurrently) is at most the same as the number of qubits.Thus it is important to prioritize the scheduling of critical gatesover non-critical gates.To prioritize critical gates, what we need to do is toresolve the connectivity problems of critical gates as early aspossible. Imagine a scenario where two gates have connectivityproblems, one gate is critical and the other is non-critical gate.Their connectivity issues cannot be resolved at the same time.Under this situation, we should resolve the critical gate ﬁrst, asresolving the non-critical gates can be likely delayed withoutaffecting the overall execution time. H q (Q )q (Q )q (Q )q (Q ) S H (a) (b) (c)Qubit Coupling g1g2 H q (Q )q (Q ) S H ﬁxed slack H q (Q )q (Q )q (Q )q (Q ) S H (d) g1g21-cycle slack 2-cycle slack g3g3 g1 g2

Fig. 4: (a) Qubit coupling graph; (b) An example of ﬁxed slack in the quantum circuit; (c) and (d) Examples of ﬂexible slacks,where g1 can be moved within a time window without affecting the circuit execution time, the slack before g1 can be either1 cycle or 2 cycles assuming every gate takes one cycle. Note the slack between g1 and g3 may also vary due to schedulingof g1. Processed Circuit (PC) S Remaining Circuit (RC)

S: candidate SWAPs

S1 S2 S3 Sn … Hardware-compliant candidates

Select S x with the greatest SU(X) SU(1) SU(2) SU(3) SU(n) … Calculate Slack Utilization:

SU(X) = calc_slack_util(PC, S X , RC) Fig. 5: Choose SWAP Candidates H q (Q )q (Q )q (Q )q (Q ) S H XX (Q )(Q ) swap (3,4) absorbs the slackProcessed Circuit Remaining Circuit g1g3 Q1 Q2Q3 Q4 (a)Qubit Coupling Graph (b)

Fig. 6: Scheduling gates to create more slacks: gate g1 can bemoved forward such that swap Q3, Q4 can absorb the longerslack on q3We use an example from Fig. 7 to show how criticality canplay an important role in determining the overall circuit time.We use a ﬁve-qubit quantum machine, whose connectivity isshown in Fig. 7 (a). This example circuit consists of 4 CNOTsand 3 single-qubit gates, with g , g scheduled, and g , g notyet scheduled due to connectivity issues. It’s crucial to notethat g is on the critical path, while g is not. The two gates g and g cannot be resolved at the same time if both of themwant to use only one swap, since qubit C is the on the pathfrom T to T , and from X to X . Whether to prioritize g over g when using the hub qubit C for swap, makes a big difference in terms of circuit time. We show this discrepancyby illustrating two strategies and their resulting circuits. • Strategy One - Prioritizing critical gates

Resolve g ﬁrst. Shown in Fig 7 (c), it is necessary to insert swap( q , q ) before g . After g is resolved and scheduled,swap( q , q ) is inserted such that g can be resolved.swap( q , q ) can take advantage of the slack on logicalqubits q and q , as logical qubit q is processing threesingle-qubit gates. This strategy results in total circuittime of 10 cycles, assuming CNOT and single-qubit gateboth have latencies of one cycle. • Strategy Two - Not distinguishing critical gates fromnon-critical path

Resolve g ﬁrst. Shown in Fig 7 (d),it is necessary to insert swap( q , q ) before g , and let g be scheduled. In the meantime, g has to wait, whichresults in the critical path being elongated due to the delayof the execution time of g and swap( q , q ). It is becausewhen g is being executed, the mapping that allows g must be kept, which will delay all the remaining gates.In this case, it is not desirable to delay all the remaininggates as they are on critical path. Delaying gates that arecritical will have a more detrimental impact than delayinggates not on critical path. After g is resolved, the fastestway to resolve g is to swap( q , q ) before g . Thisstrategy as a whole results in total circuit time of 13cycles, which is 30% more than strategy one.It can be seen from this example the later resolving of thenon-critical gates are highly likely to overlap with the gateson the critical path, and result in less impact to overall circuitexecution. IV. I MPLEMENTATION

Based on the design consideration on Section III, we imple-ment a slack-aware qubit mapping framework called

SlackQ . A. Overview of

SlackQ

Our algorithm is an iterative gate scheduler which dynam-ically resolves the connectivity issues encountered during thescheduling process. Initially, a dependency graph of the circuitis built. Then we traverse the dependency graph of the circuit g4g3 q3 (C)

H S H g3 H S H XX (C)(X2) XX (C)(T1) H S H XX (C)(T1) XX (C)(X2) (a) (b) (c) (d)

10 cycles 13 cycles

T1X1 C X2T2 g2g1 g4 g2g1 g3g4g2g1 q1 (X1)q2 (X2)q4 (T1)q5 (T2)q3 (C) q1 (X1)q2 (X2)q4 (T1)q5 (T2)q3 (C)

Fig. 7: (a) Qubit Connectivity for a ﬁve-qubit machine (b) Original Circuit before qubit mapping (c)

Strategy One:

Resolvinggates on critical path ﬁrst (d)

Strategy Two:

Resolving gates on non-critical path ﬁrst.and schedule the gates one by one. We keep a frontier set ofgates ready to be scheduled. When resolving the connectivityissues, we invoke a priority-queue based searcher for swapcandidates. It returns hardware-compliant candidates. Amongthese hardware-compliant candidates, the one that has thebest slack utilization is chosen, and the scheduling processproceeds. We describe the algorithm below with respect to thepseudo-code shown in Algorithm 1:

Step One - Initialization

This step prepares for the searchingprocess. It builds the dependency graph of circuit. It ﬁnds thegates that do not depend on any other gates. Then it placesthose gates into the frontier F . It also initializes the processedgate set P as empty set, and the remaining gate sets R as theentire circuit. Step Two - Schedule Ready Gates

This step goes throughfrontier list F . It ﬁnds all gates in F that can be scheduledimmediately due to having no connectivity issues accordingto the current mapping π . It schedules all these gates. Whenﬁnishing the scheduling of one gate, it ﬁnds the descendantgate and see if this descendant’s other parent has also beenscheduled. If this is the case, the descendant gate’s dependencyis resolved. It then places this descendant gate into F . Thisstep is repeated until F contains no gate that can be scheduledwith respect to the current qubit mapping. Step Three - Resolve Qubit Mapping Conﬂicts

We gothrough the frontier F again, ﬁnding the gates with resolveddependencies but are constrained by the current mapping andare on the critical path of the remaining circuit. Put those gatesinto a set called F critical . Run a priority queue based searcherfor hardware-compliant mappings. Our mapping searcher herereturns a list of hardware-compliant mappings candidates,called M . Among these candidates, it ﬁnds the one (call it m )with the best slack utilization. Then we use the swap sequenceassociated with m to update the mapping π and add the swapsequence into the processed circuit P. Step Four

Repeat Step Two and Three until all gates arescheduled in the circuit. Return transformed circuit.In Sections IV-B to IV-D, we describe a few importantaspects of this algorithm.

B. Initialization

Before calling the scheduler described in Algorithm 1,we initialize the frontier F and processed circuit P, and the remaining circuit R. Initially, P is empty and R is the entirecircuit. For F , it creates a Directed Acyclic Graph (DAG) torepresent the dependency between quantum operations. Fig. 8(a) shows an example of a dependency graph from the circuitillustrated in Fig. 1 (b). Algorithm 1:

Dynamic Gate Scheduler

Input :

Frontier F , initial mapping π , processed circuit P , remaining circuit R Output:

Transformed circuit T while

F not empty do E = getSchedulableGates(F, π ); while E not empty do F.remove(E);P.add(E);R.remove(E); for g ∈ E dofor d ∈ g.children doif d’s dependency is resolved then F.add(d); endendend

E = getSchedulableGates(F, π ); end F critical = select critical gates ( F );mapping candidates = resolve conﬂicts ( F critical , π );m = best slack utilization (mapping candidates, P,R); π = update mapping with swaps(m, π );P.add(m.swaps); end return P; C. Choosing the Best Mapping Candidate

With multiple hardware-compliant mappings, it is necessaryto determine the candidate that has the best slack utilization.The best slack utilization means the inserted swap sequencemakes best use of the slack currently existing in the circuit.To evaluate these mapping candidates, for each of them, wetentatively insert the associated swap sequence and monitorow the swap insertions affect the dependence graph and thecritical path. We trace the nodes that are affected due to theinserted swaps and detect how much their start/ending timechanges.We again use the example circuit and qubit coupling graphin Fig. 1 to show how our evaluation approach works. Weﬁrst show the original dependency graph in Fig. 8 (a). Thenumbers on gates denote the start/end cycle of each gate. Forinstance, [0, 1] represents the start cycle as 0 and the endingcycle as 1. We assign a dummy gate node at the beginning ofthe circuit for each qubit q ˜ q , for the sake of illustration.The dummy node starts at time 0 and takes 0 cycles. In thisexample, we have two possible mapping candidates whoseswap sequences are to be inserted on different qubits. We needto choose the better one out of the two mapping candidates.The ﬁrst mapping candidate shown in Fig. 8 (b) inserts oneswap on logical qubits q (the qubit for g ) and q in betweengates g and g , corresponding to the circuit in Fig. 1 (d). Itaffects gates g and g , which are marked in red. For eachaffected node in the dependence graph, to calculate its earlieststart time, one needs to check each of its parent nodes’ endingtime, choose the maximum one, and use it as its own start time.In this example, added swap result in change in g5’s start timeas well as the change in g6’start time, and delays the entirecircuit by 3 cycles. Here we assume each gate takes one cycle.We assume a swap is implemented using 3 CNOT gates andthus is 3 cycles.The second mapping candidate shown in Fig. 8 (c) insertsone swap on logical qubits q and q placed right in frontof g . It results in no changes to the start/end cycles of theentire circuit, since there is slack on physical qubits q and q .Obviously, it should choose the mapping candidate illustratedin Fig. 8 (c).We use an algorithm to systematically analyze thestart/ending time of each gate due to inserted swaps. Thealgorithm does not have to traverse the entire dependencegraph. Instead, it only traces the affected gates in the originalcircuit to ﬁnd the candidate that leads to the smallest incrementto total circuit time. We also add an optimization to quicklyterminate the tracing when a candidate is deemed hopeless. Acandidate is deemed hopeless if one of the affected gates ison the critical path and the delay to that gate due to swapsalready exceeds the smallest increment found in a previouscandidate. In that case, it terminates the tracing and moves onto the next candidate. The algorithm is described in Algorithm2. D. Navigating the Candidate Search Space

We use a priority queue based searcher for qubit mappingcandidates. The search space consists of state nodes thatrepresent possible mappings from logical qubits to physicalqubits. A mapping can be represented as π : { q , q , ..., q n } →{ Q , Q , ..., Q n } . Applying swaps on top of a mapping canconvert it into another mapping. Speciﬁcally, if we apply” swap q i , q j ” on a certain mapping π old and create theresulting mapping π new , we will have π new [ q i ] = π old [ q j ] and π new [ q j ] = π old [ q i ] . Algorithm 2:

Find the best slack-utilizing mapping

Input :

Mapping candidates M , dependency graph G ,proessed circuit P Output:

Best slack-utilizing mapping m best smallest inc = ∞ ;CP = getCriticalPath(G); for m ∈ M do RG = getLastScheduledGateOnEachQubit(P);G’ = G;G’.addGates(m.swaps);graph updated = True;circuit time = 0; while RG not empty do RG (cid:48) = []; for g ∈ RG do g.updateTentativeStartAndEndCycle();delta = g.tentativeStart - g.originalStart; if g is on critical path & delta ¿ smallest inc then graph updated = False;break the while loop; endif delta ¿ 0 then RG’.add(g.children); end circuit time = max(circuit time,g.tentativeEnd); end

RG = RG’; endif (graph updated == True & circuit time > CP) thenif ( (circuit time - CP) < smallest inc ) then smallest inc = circuit time - CP ; m best = m; endendend return m best ;Given F critical and current mapping π , it starts searchingthe state space of all feasible mappings that satisfy F critical .It picks a node to expand and enumerate all possible parallelone-step swaps as the node’s successors. We use a priorityqueue that is similar to that in [16]. Unlike the work by [16]where the search stops when the ﬁrst state node that resolvesall connectivity conﬂicts is retrieve from the priority queue, oursearch stops after m expansions since the mapping candidatewith minimal swap count is found, or when the gate countof the mapping candidate that is just retrieve has less thanor equal to k times more gates than the minimal swap count.We set m = 20 and k = 2 such that the returned mappingcandidate will have reasonable gate counts. After all mappingcandidates have been retrieved, we rank them with respect to

0, 1]

DummyDummyDummyDummyDummy [1, 2] [2, 3] [3, 4][4, 5] [0, 1]

DummyDummyDummyDummyDummy [1, 2] [1, 2] [2, 3] [6, 7][7, 8]SWAP[3, 6] [0, 1]

DummyDummyDummyDummyDummy [1, 2] [1, 2] [2, 3] [3, 4][4, 5]SWAP[0, 3] (a) (b) (c) g1 g2g3 g4 g5 g6 g1 g2g3 g4 g5 g6 g1 g2g3 g4 g5 g6 [1, 2] (q1)(q2)(q3)(q4)(q5) (q1)(q2)(q3)(q4)(q5) (q1)(q2)(q3)(q4)(q5)

Fig. 8: (a) The generated dependency graph from the example of Fig. 1. The numbers displayed on each gate refer to thestart/end cycle of this gate. (b) One mapping candidate whose inserted swap results in two later gates delaying its start/endcycles. (C) Another mapping candidate whose inserted swap does not affect the start/end cycles of the entire circuit.the metric of best slack utilization discussed above.V. E

VALUATION

In this section, we evaluate our slack-aware swap insertionscheme (

SlackQ ) and compare it with the two state-of-the-artqubit mappers, respectively by [16] and [13].

A. Experiment Setup

Benchmark.

We use 106 benchmarks from RevLib [1],IBM Qiskit [2], and ScaffCC [3]. RevLib comprises of a col-lection of benchmarks in the domain of reversible and quantumcircuit design. Qiskit is a programming framework for quan-tum computing provided by IBM. ScaffCC is a compilationframework for the Scaffold quantum programming language.These benchmarks feature functionalities from implementingALU logics, comparing inputs with constant values, ternarycounters, to classic quantum algorithms like Quantum FourierTransform (QFT) and ising model.

Baseline

We compare our work with two best known qubitmapping solutions [16] (denoted as

Zulehner ) and the Sabrequbit mapper from [13] (denoted as

Sabre ). We also compareour results with IBM’s stochastic mapper in Qiskit [2]. SinceIBM’s Qiskit mapper is signiﬁcantly worse in terms of circuittime than all other mappers we have evaluated, we do notshow the results. The performance of Qiskit mapper is alsonoted in the work by [16].

Metrics

We compare the execution time of the transformedcircuits generated by different qubit mapping strategies. It isworth mentioning that our approach can take any gate latencyas input parameters and generate transformed circuits basedon the input. However, to make evaluation results as close toreal machines as possible, we use the results from the studiesby [22], [23]. In these studies, different types of quantumarchitecture are investigated, and the studies reveal that two-qubit gates usually takes around twice as much time as single-qubit gates. Hence we assume single-qubit gates take 1 cycleand two-qubit CNOT gates take 2 cycles in our experiments.The time is reported as the total number of executed cycles.

Platform

We use IBM’s 20-qubit Q20 Tokyo architecture[13] as the underlying quantum hardware. The qubit mappingapproach is implemented in C++ and executed on a Intel 2.4GHz Core i5 machine, with 8 GB 1600 MHz DDR3 memory.

B. Experiment Analysis

We categorize the 106 benchmarks into four categories.Benchmarks in the ﬁrst category each has less than 200 gates,and we denote them as mini benchmarks. There are 22 minibenchmarks. The second category has benchmarks with 200to 1,000 gates. We name this category as small benchmarks.There are 39 small benchmarks. The third category of bench-marks have 1,000 to 10,000 gates. We name it as medium benchmarks and there are 21 benchmarks in this category.The fourth category of benchmarks have 10,000 to 200,000gates. We refer to it as large benchmarks and there are 24benchmarks in this category. The results for mini , small , medium , and large benchmarks are presented in Fig. 9, Fig.10, Fig. 11, and Fig. 12 respectively. Speedup Over Zulehner Speedup Over Sabre

Fig. 9: Speedup for Mini Benchmarks (¡ 200 gates)

Speedup Over Zulehner Speedup Over Sabre

Fig. 10: Speedup for Small Benchmarks (¡ 1000 gates)It can be observed from the results that as the problemsize scales, the performance improvement brought by

SlackQ improves. For most benchmarks in the mini and small cat-egory, the speedup is between 1.1X and 1.5X. However,for the medium and large category, the speedup for mostbenchmarks is above or around 1.5X. The average speedup formini benchmarks is 1.45X and for small, medium, and large

Speedup Over Zulehner Speedup Over Sabre

Fig. 11: Speedup for Medium Benchmarks (¡ 10000 gates)

Speedup Over Zulehner Speedup Over Sabre

Fig. 12: Speedup for Large Benchmarks (¡ 200000 gates)benchmarks, the average speedup becomes 1.55X, 1.70X, and1.86X respectively. The results show that our approach workswell in general, and in particular for larger benchmarks.There are two baselines we compare against: the Zulehnerapproach and the Sabre approach. For mini and small bench-marks, the

Zulehner approach does not seem to perform aswell as the

Sabre approach. It can be seen from the factthat the relative speedup of

SlackQ over

Zulehner is usuallylarger than

SlackQ over

Sabre . However, the Sabre approachperforms worse than the Zulehner approach for medium andlarge benchmarks. It can be seen that

Zulehner and

Sabre perform well in different scenarios when compared againsteach other. Regardless, our approach

SlackQ outperforms bothof them. VI.

CONCLUSION

The physical layout of contemporary quantum devices im-poses limitations for mapping a high level quantum program tothe hardware. It is critical to develop an efﬁcient qubit mapperin the NISQ era. Existing studies aim to reduce the gatecount but are oblivious to the depth of the transformed circuit.This paper presents the design of the ﬁrst time-efﬁcient slack-aware swap insertion scheme. Experiment results show that ourproposed solution generates hardware-compliant circuits withfaster execution time compared with state-of-the-art mappingschemes. R

EFERENCES[1] R. Wille, D. Große, L. Teuber, G. W. Dueck, and R. Drechsler, “Revlib:An online resource for reversible functions and reversible circuits,” in .IEEE, 2008, pp. 220–225.[2] QISKit: Open Source Quantum Information Science Kit, https://https://qiskit.org/. [3] A. JavadiAbhari, S. Patil, D. Kudrow, J. Heckey, A. Lvov, F. T. Chong,and M. Martonosi, “Scaffcc: a framework for compilation and analysisof quantum computing programs,” in

Proceedings of the 11th ACMConference on Computing Frontiers . ACM, 2014, p. 1.[4] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends,R. Biswas, S. Boixo, F. G. S. L. Brandao, D. A. Buell, B. Burkett,Y. Chen, Z. Chen, B. Chiaro, R. Collins, W. Courtney, A. Dunsworth,E. Farhi, B. Foxen, A. Fowler, C. Gidney, M. Giustina, R. Graff,K. Guerin, S. Habegger, M. P. Harrigan, M. J. Hartmann, A. Ho,M. Hoffmann, T. Huang, T. S. Humble, S. V. Isakov, E. Jeffrey, Z. Jiang,D. Kafri, K. Kechedzhi, J. Kelly, P. V. Klimov, S. Knysh, A. Korotkov,F. Kostritsa, D. Landhuis, M. Lindmark, E. Lucero, D. Lyakh, S. Mandr`a,J. R. McClean, M. McEwen, A. Megrant, X. Mi, K. Michielsen,M. Mohseni, J. Mutus, O. Naaman, M. Neeley, C. Neill, M. Y. Niu,E. Ostby, A. Petukhov, J. C. Platt, C. Quintana, E. G. Rieffel, P. Roushan,N. C. Rubin, D. Sank, K. J. Satzinger, V. Smelyanskiy, K. J. Sung, M. D.Trevithick, A. Vainsencher, B. Villalonga, T. White, Z. J. Yao, P. Yeh,A. Zalcman, H. Neven, and J. M. Martinis, “Quantum supremacy usinga programmable superconducting processor,”

Nature , vol. 574, no. 7779,pp. 505–510, 2019.[5] P. W. Shor, “Algorithms for quantum computation: Discrete logarithmsand factoring,” in

Proceedings 35th annual symposium on foundationsof computer science . Ieee, 1994, pp. 124–134.[6] L. K. Grover, “A fast quantum mechanical algorithm for databasesearch,” in

Proceedings of the Twenty-eighth Annual ACM Symposiumon Theory of Computing , ser. STOC ’96. New York, NY, USA:ACM, 1996, pp. 212–219. [Online]. Available: http://doi.acm.org/10.1145/237814.237866[7] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou,P. J. Love, A. Aspuru-Guzik, and J. L. OBrien, “A variationaleigenvalue solver on a photonic quantum processor,” in

NatureCommunications

Proceedings of the Twenty-FourthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems . ACM, 2019, pp. 1001–1014.[14] R. Wille, L. Burgholzer, and A. Zulehner, “Mapping quantum circuitsto ibm qx architectures using the minimal number of swap and hoperations,” in

Proceedings of the 56th Annual Design AutomationConference 2019 . ACM, 2019, p. 142.[15] A. Zulehner, S. Gasser, and R. Wille, “Exact global reordering for near-est neighbor quantum circuits using A ∗ ,” in International Conferenceon Reversible Computation . Springer, 2017, pp. 185–201.[16] A. Zulehner, A. Paler, and R. Wille, “Efﬁcient mapping of quantumcircuits to the ibm qx architectures,” in . IEEE, 2018, pp.1135–1138.[17] M. Y. Siraichi, V. F. d. Santos, S. Collange, and F. M. Q. Pereira,“Qubit allocation,” in

Proceedings of the 2018 International Symposiumon Code Generation and Optimization . ACM, 2018, pp. 113–125.[18] A. W. Cross, L. S. Bishop, S. Sheldon, P. D. Nation, and J. M.Gambetta, “Validating quantum computers using randomized modelcircuits,”

Physical Review A , vol. 100, no. 3, Sep 2019. [Online].Available: http://dx.doi.org/10.1103/PhysRevA.100.032328[19] M. A. Nielsen and I. Chuang, “Quantum computation and quantuminformation,” 2002.[20] M. Amy, D. Maslov, M. Mosca, and M. Roetteler, “A meet-in-the-middle algorithm for fast synthesis of depth-optimal quantum circuits,”

IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , vol. 32, no. 6, p. 818830, Jun 2013. [Online]. Available:http://dx.doi.org/10.1109/TCAD.2013.224464321] A. W. Cross, L. S. Bishop, J. A. Smolin, and J. M. Gambetta, “Openquantum assembly language,” arXiv preprint arXiv:1707.03429 , 2017.[22] P. Murali, J. M. Baker, A. Javadi-Abhari, F. T. Chong, and M. Martonosi,“Noise-adaptive compiler mappings for noisy intermediate-scalequantum computers,” in

Proceedings of the Twenty-Fourth InternationalConference on Architectural Support for Programming Languagesand Operating Systems , ser. ASPLOS ’19. New York, NY,USA: ACM, 2019, pp. 1015–1029. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304075[23] N. M. Linke, D. Maslov, M. Roetteler, S. Debnath, C. Figgatt, K. A.Landsman, K. Wright, and C. Monroe, “Experimental comparison oftwo quantum computing architectures,”