[PDF] TIGER: Topology-aware Assignment using Ising machines Application to Classical Algorithm Tasks and Quantum Circuit Gates

Abstract

Optimally mapping a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar mapping problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuristic or based on physical optimization algorithms, providing different speed and solution quality trade-offs. Ising machines such as quantum and digital annealers have recently become available and offer an alternative hardware solution to solve this type of optimization problems. In this paper, we propose an algorithm that allows solving the topology-aware assignment problem using Ising machines. We demonstrate the algorithm on two use cases, i.e. classical task scheduling and quantum circuit gate scheduling. TIGER---topology-aware task/gate assignment mapper tool---implements our proposed algorithms and automatically integrates them into the quantum software environment. To address the limitations of physical solver, we propose and implement a domain-specific partition strategy that allows solving larger-scale problems and a weight optimization algorithm that allows tuning Ising model parameters to achieve better restuls. We use D-Wave's quantum annealer to demonstrate our algorithm and evaluate the proposed tool flow in terms of performance, partition efficiency, and solution quality. Results show significant speed-up compared to classical solutions, better scalability, and higher solution quality when using TIGER together with the proposed partition method. It reduces the data movement cost by 68\% in average for quantum circuit assignment compared to the IBM QX optimizer.

Full PDF

TTIGER: Topology-aware Assignment using Ising machines

Application to Classical Algorithm Tasks and Quantum Circuit Gates

Anastasiia Butko · Ilyas Turimbetov · George Michelogiannakis · DavidDonofrio · Didem Unat · John Shalf

September 20, 2020

Abstract

Optimally mapping a parallel application tocompute and communication resources is increasinglyimportant as both system size and heterogeneity in-crease. A similar mapping problem exists in gate-basedquantum computing where the objective is to map tasksto gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing taskassignment approaches are either heuristic or based onphysical optimization algorithms, providing diﬀerentspeed and solution quality trade-oﬀs. Ising machinessuch as quantum and digital annealers have recentlybecome available and oﬀer an alternative hardware so-lution to solve this type of optimization problems. Inthis paper, we propose an algorithm that allows solv-ing the topology-aware assignment problem using Isingmachines. We demonstrate the algorithm on two usecases, i.e. classical task scheduling and quantum circuitgate scheduling. TIGER—topology-aware task/gate as-signment mapper tool—implements our proposed al-gorithms and automatically integrates them into thequantum software environment. To address the limi-tations of physical solver, we propose and implementa domain-speciﬁc partition strategy that allows solv-ing larger-scale problems and a weight optimization al-gorithm that allows tuning Ising model parameters toachieve better restuls. We use D-Wave’s quantum an-nealer to demonstrate our algorithm and evaluate theproposed tool ﬂow in terms of performance, partition

A. Butko · G. Michelogiannakis · D. Donofrio · J. ShalfLawrence Berkeley National LaboratoryBerkeley CA 94720, USA E-mail: { abutko,mihelog,ddonofrio,jshalf } @lbl.govI. Turimbetov · D. UnatKo¸c University, Istanbul 34450, TurkeyE-mail: { iturimbetov18,dunat } @ku.edu.tr eﬃciency, and solution quality. Results show signiﬁcantspeed-up compared to classical solutions, better scala-bility, and higher solution quality when using TIGERtogether with the proposed partition method. It reducesthe data movement cost by 68% in average for quan-tum circuit assignment compared to the IBM QX opti-mizer [15]. Keywords

Topology-aware task assignment · gatescheduling optimization · Ising machine · quantumannealing. The task assignment problem aims to maximize appli-cation performance by balancing computational loadamong multiple and often heterogeneous processingunits while reducing compute overhead. The task as-signment problem has been shown to be equivalent toa graph isomorphism problem by Bokhari [1], whichis known to be NP-complete [20,13]. Therefore, manysolvers for this problem are heuristic [31] that inevitablytradeoﬀ solution quality for computation speed, orphysical optimization algorithms, such as simulated an-nealing [34], genetic techniques [25], and others. Inaddition, solvers can have diﬀerent optimization met-rics that are often contradictory, such as computa-tional load, communication cost, or a weighted com-bination [29,4].Scheduling quantum gates onto physical qubits issimilarly a challenging problem, given the complexityand variety of quantum operations and physical restric-tions of each quantum chip. To keep operations eﬃcient,quantum gates should be scheduled on quantum hard-ware such as to minimize the number of operations andmaximize quantum circuit ﬁdelity (how much quantum a r X i v : . [ c s . ET ] S e p A. Butko et al. information is preserved), while taking into accountthe connectivity between physical qubits [10]. Conse-quently, many mapping algorithms scale poorly due toruntime, memory usage, and the quality of their gen-erated solutions [21]. In addition, the quality of theirsolutions compared to the theoretical optimal is un-known [35]. These challenges indicate that gate assign-ment may hinder high-quality solutions on future quan-tum accelerators with more physical qubits and com-plex connectivity.While genetic algorithms and simulated annealingare often considered best practices, recent Ising ma-chines oﬀer an alternative hardware solution for a set ofoptimization problems, such as task scheduling. TheseIsing machines can be implemented using diﬀerent tech-nologies and exploit various physical eﬀects. Such ex-amples include coherent Ising machines [37], Fujitsu’sdigital annealer [9], and quantum annealers designedby D-Wave Systems Inc. [16]. Several studing on quan-tum annealers [22] [19] explore its capabilities and lim-itations projecting the potential of these machines forfuture use.Despite the potential beneﬁts oﬀered by quantumannealers combined with a growing interest in alterna-tive solutions, practical applicability of annealing ma-chines remains highly questionable. One of the reasonsis physical limitations of current machines, namely therelatively small size of the chip and the poor connec-tivity between qubits [19]. Problem sizes demonstratedin comparison studies are usually not competitive withthose handled by classical solvers. Therefore, eﬀectiveproblem partitioning and post-processing are requiredto continue exploiting quantum solver capabilities whilethe solution for physical limitations is sought [38]. Thatmakes most of the near-term quantum annealing-basedapproaches classical-quantum hybrids.Another obstacle towards wide-spread quantum an-nealer adoption is programming complexity. Its pro-gramming model is based on the Quadratic Uncon-strained Binary Optimization (QUBO) [12] model thatis diﬀerent form the conventional programming and re-quires special approaches. The highest level that usersare required to program D-Wave is “virtual” QUBO,where “virtual” means that the compiler takes care ofmapping and routing the problem while taking into ac-count device connectivity. Transforming a problem intoQUBO format is not a trivial task. Higher-level tools aswell as eﬃcient algorithms are typically required [27].In this work, we present the Topology-aware task as-sIGnment mappER (TIGER) to solve the assignmentproblem using Ising machines. Namely, our contribu-tions are: – We develop an algorithm to assign Task-Communication Graph (TCG) to the architectureunits minimizing the required data-movementand maximizing the performance. The assignmentproblem is expressed in the QUBO format to beused by an Ising machine. – We develop an algorithm to assign Quantum CircuitGraph (QCG) to the qubits minimizing data move-ment (number of SWAP operations) and miximizingthe ﬁdelity. The assignment problem is expressed inthe QUBO format to be used by an Ising machine. – We develop a domain-speciﬁc QUBO partitioningalgorithm (sub-QUBO) based on the graph depen-dency levels to overcome current physical limita-tions of existing quantum annealers and acceleratethe solution search. – We develop a weight optimization algorithm (WOA)to tune Ising equation parameters in order to priori-tize target metrics and adjust them to obtain bettersolutions. – We implement these algorithms as a TIGER tool.TIGER is written in Python and uses the NetworkXpackage [7] to create and manipulate TCG/QCGand ARC structures. – We integrate TIGER into the D-Wave tool-ﬂow bysupporting qbsolv qubo [2], qmasm [26] formats andcreating a feedback loop from D-Wave to TIGER inorder to evaluate the solution for further optimiza-tions. – We evaluate the proposed algorithms and its im-plementation using D-Wave quantum annealer. Wecompare the D-Wave solver performance and qual-ity of the task assignment (solution) to the classicalTABU-search algorithm. We evaluate the quality ofthe quantum circuits assignment in terms of the cir-cuit ﬁdelity using real IBM systems [15] and com-pare it against IBM QX gate optimizer. Our resultsshow that TIGER with the D-Wave annealer pro-vides up to 8% of computation cost improvementand up to 25% of communication cost improvementcompared to the classical TABU-search solver whenassigning a TCG. It reduces the data movement costby 68% in average for quantum circuit assignmentcompared to the IBM QX optimizer [15].Given the relatively small size of the evaluatedquantum annealer, we leave the discussion on generalcompetitiveness of quantum annealers against classicalcomputing out of the scope of this paper. Our resultsaim to provide useful insights on the entire tool-ﬂowincluding classical decomposition, domain-speciﬁc par-tition and QUBO solvers. Last but not least, we wouldlike to extend an invitation to the community to useTIGER and then contribute back to aid tool growth.

IGER: Topology-aware Assignment using Ising machines 3

Latest updates, documentation, and support can befound online .The rest of the paper is organized as follows: Sec-tion 2 provides the background on the existing Isingmachines. Section 3 and Section 4 describe the pro-posed task assignment and quantum gate assignmentmapping approaches, respectively. Section 5 describesTIGER tool implementation as well as its integrationinto the complete tool-ﬂow with the D-Wave program-ming environment. Section 6 shows performance, qual-ity, sensitivity and scalability evaluation results. Section7 concludes the work. Ising machines are special-purpose processors that solvethe Ising model, an intensely-studied NP-completeproblem that is a system of interacting classicalspins [5]. An Ising model is mathematical model com-posed of a large lattice of sites, where each site can bein one of two states. This model can be used to modelthe impact to the global state of the system caused bychanges to parameters (such as connectivity and desiredoperations). Ising models have been used to express andperform computation with diﬀerent materials such aslasers and magnets, but are also the basis of severalquantum accelerators because they are a natural ﬁt toexpress a graph of interconnected qubits.

Quantum annealing [18] is a metaheuristic techniquefor solving local search problems, such as ﬁnding theglobal minimum or maximum in a discrete search space.Quantum annealing oﬀers potential beneﬁts comparedto popular heuristic algorithms through its quantumtunneling eﬀect. This eﬀect allows the system to pen-etrate energy barriers escaping from the local minimaand therefore ﬁnd better solutions to the original opti-mization problem.A quantum annealing machine or a quantum an-nealer is a hardware implementation of the adiabaticquantum computing algorithm. Quantum annealers op-erate on a set of qubits. A qubit is a two-state quantum-mechanical system that can carry states | (cid:105) and | (cid:105) orbe in superposition that expresses a linear superposi-tion of the ”basis states”, i.e. | (cid:105) and | (cid:105) . This featureforms the key power of quantum machines, which with n qubits can be in an arbitrary superposition of up to 2 n diﬀerent states simultaneously. Another inherent quan-tum property of qubits is quantum entanglement where https://github.com/lbnlcomputerarch/tiger a group of qubits is coupled to each other in such away that the state of each qubit cannot be perceivedseparately, but as a whole system state instead [24].Quantum annealers provided by D-Wave SystemsInc. have been commercially available since 2011 [16].D-Wave quantum chips are implemented using super-conducting technology and require an extreme isolatedenvironment with a temperature close to absolute zero.A closed cycle dilution refrigerator cools the proces-sor down to 15 mK. Therefore, while the actual quan-tum chip is the size of a stamp, the physical volumeof the whole D-Wave system reaches 20 m . However,D-Wave machines consume less than 25 kW of power,mostly for cooling and front-end servers [17]. In around10 years, quantum annealing chips have reached 10 number of qubits, promising signiﬁcant performanceimprovement for certain computing problems in thenear future. Physically, qubits are connected to eachother using a so-called Chimera topology. The small-est Chimera unit contains a complete bipartite graphof eight vertices, each of which is connected to its fourneighbours inside the unit and to its two neighboursoutside the unit.In [6], authors compare the performance of physicalquantum annealer (D-Wave 2X quantum annealer) tosimulated annealing and quantum Monte Carlo meth-ods executed on a classical processor.Furthermore, authors in [22] extend Google Inc.studies by comparing quantum annealing to state-of-the art optimization methods, introducing more sophis-ticated assessment metrics. Their work considers fourcategories of optimization methods: sequential meth-ods that include quantum annealing, simulated an-nealing and quantum Monte Carlo, tailored methodsthat solve simpliﬁed optimization problems, and non-tailored methods that are generic and thus represent thestate of the art. Authors conclude that physical quan-tum annealing has better scaling compared to other se-quential optimization methods, but it concedes to tai-lored as well as non-tailored state-of-the-art methods.Also, authors emphasize the importance of determin-ing the application domain where quantum annealingmaximizes its beneﬁts, but this has yet to be deﬁned.Finally, King et al. in [19] introduce a problem classthat can maximize usefulness of the quantum tunnelingeﬀect. Authors again compare quantum annealers toclassical solvers and demonstrate three to four orders ofmagnitude performance speed-up in favor of quantumannealing.Several studies demonstrate the use of quantumannealing for task scheduling. In [32], authors intro-duce a hybrid quantum-classical approach to solvingscheduling problems. Their framework integrates quan- A. Butko et al.

4 T A S K S q q q q q q q q q q q q q q q q P R O C E S S I N G U N I T S T A S K S

4 P U s a) QAP mapping on QUBO

5 T A S K S

2 P U s q q q q q q q q q q b) TCG mapping on QUBO c) TCG partitioning and mapping on QUBO e d g e s 01 2 34 5 6 789 10 S G 1S G 2S G 3

X X X XX XX X q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q S u b - Q U B O 1S u b - Q U B O 2 S u b - Q U B O 3 i n p u t e d g e s i n p u t e d g e s

Fig. 1: Task Communication Graph (TCG) assignment on a heterogeneous multi-PU system: problem mappingon QUBO.tum annealing with classical computing into a guidedtree search. Classical algorithms manage a global treesearch and communicate the node search in QUBO for-mat to the quantum annealer. Authors test the pro-posed framework on three scheduling problems, i.e.graph-coloring, mars lander task scheduling, and air-port runway scheduling. Results show that the quan-tum annealer’s output can eﬀectively prune and guidethe search process. Authors motivate their work by ne-cessity to expand on the capabilities of current quantumannealers and do not expect quantum annealers to becompetitive in the near-term against classical comput-ers.In our work, we address a diﬀerent schedulingproblem, i.e. topology-aware assignment. The proposedTIGER framework extends existing software environ-ments by automatically generating and dynamically ad-justing QUBO ﬁles. We evaluate the tool ﬂow in termsof quantum solver performance, the quality of task/gateassignment and discuss the potential scalability of near-term machines.2.1 Problem formulation and programmingQuantum annealers minimize the QUBO problem de-scribed by Equation 1. The equation describes the evo-lution of the time-dependent Hamiltonian [14] that aimsto ﬁnd low-energy states in a system of N interactingspins, i.e. qubits. In Equation 1, q i represents qubitsthat take value from the set { , } , h i is a weight co-eﬃcient associated with each qubit, J ij denotes the strength of the couplings between two qubits q i and q j and N is the number of qubits. E ( q , ..., q N ) = N (cid:88) i =1 h i · q i + N (cid:88) i

IGER: Topology-aware Assignment using Ising machines 5

02 13 c o m p u t a t i o n t a s k p l a c e m e n t

02 13 02 13 02 13 02 13 02 13 02 13

23 1 746 5 8 9 10

N o i n p u te d g e s s o u r c e u n i t d e s t i n a t i o n u n i t i n t e r - u n i t c o m m u n i c a t i o n l o c a l c o m m u n i c a t i o n

Fig. 2: Binary solution interpretation: computation task assignment and communication impact. Q represents thepermutation matrix X , where each qubit deﬁnes the as-signment of a task to a speciﬁc PU similar to x ij above.An x ij value of 1 represents that task i was assigned toPU j . A weight coeﬃcient h i (not shown) represents thecomputational cost of the assignment. Since solvers incurrent machines ﬁnd local minima, we transform pos-itive computation costs into negative numbers to pre-vent the solver from giving all-zero answers. To respectassignment constraints such as assigning one task toone qubit, we use qubit couplings and give them highpenalty values such as J ij >> | h i | . For example, toprevent task 0 from being placed on multiple PUs, wecouple qubits ( q · q ), ( q · q ), ( q · q ), ( q · q ), ( q · q )and ( q · q ) for four qubits. Therefore, if two of thesequbits have the same task assigned to them, the largepenalty value will make the overall solution ineligible.3.2 Task-communication graph assignmentApplications can be represented as a weighted directedacyclic graph, usually referred to as a Task Commu-nication Graph (TCG). A TCG is deﬁned as a tuple G = ( V, E ), where V = ( v i ) is a set of weighted verticeswith the weight representing task computational cost,and E = ( e i,j ) is a set of weighted edges with the weightrepresenting inter-task communication cost. An exam-ple of TCG is shown in the upper part of Figure 1(b).Mapping of such as TCG into QUBO diﬀers frompreviously shown LAP in three aspects. First, a TCGincludes not only computation cost, but also inter-task communication cost expressed with graph edges. Sec-ond, not all tasks are assigned to PUs within the sametime frame. A TCG is divided into multiple dependencylevels each of which represents a LAP. Dependency lev-els (groups) are shown with red dashed lines. Third,within each dependency level, the number of indepen-dent tasks can be diﬀerent compared to the numberof available PUs. The QUBO mapping transformationrespects each of the above three constraints. Communication edges.

Each communication edge isincluded into QUBO by qubit coupling. Communica-tion cost is represented by coupling strength. Total end-to-end cost is calculated based on the weight of eachedge in the communication path. If both source anddestination tasks are assigned to the same PU, commu-nication cost is equal to zero. This the most favourablecase if the objective is to minimize data movement. Forthe example in Figure 1(b), to deﬁne the edge between task0 and task1 we couple qubits ( q · q ) and ( q · q )with the associated topology-aware communication costand qubits ( q · q ) and ( q · q ) with zero communica-tion cost. Here, cost values are converted to negativenumbers similar to computation cost values. The rela-tive priority of communication and computation costscan be formulated by adding a weight factor to bias thesolver. Dependency levels.

Because of dependencies, onlya certain number of tasks can be assigned to PUs inparallel. This relaxes the second assignment constraintthat says that no more than one task can be placed ata PU. This constraint is valid only for tasks belongingto the same dependency group. For the example shownin Figure 1(b), task 0 is separated from task 1 and task2 with a red dashed line. Thus, we couple only qubits( q · q ) and ( q · q ) with a high penalty cost to preventplacing them on the same PU, which would otherwisebe a valid solution for the solver. The ﬁrst assignmentconstraint that says that a task can not be placed onmultiple PUs at the same time remains unchanged. Level adjustments.

When the number of paralleltasks exceeds the number of available computing re-sources, an important decision has to be taken to priori-

A. Butko et al. tize a set of tasks in the most eﬃcient way. This decisionis reﬂected in the qubit matrix, i.e. the order of columnsassociated to speciﬁc tasks and corresponding assign-ment constrain couplings. Multiple approaches exist inthe ﬁeld, but this study is out of the scope of this pa-per. Here, we apply a simple cut based on the task IDincrement. Figure 1(b) illustrates the case in which task4 belongs to dependency level 1, but is moved to thenext level. In case there are no available slots in thefollowing group of tasks, an additional level is created.3.3 Domain-speciﬁc TCG partitionGiven the number of logical qubits together with thepotential number of couplings and constrains per singleproblem, we quickly exhaust the physical capabilitiesof quantum machines. Therefore, an intelligent prob-lem partition is required. There has been extensive re-search on graph partitioning [30]. In this context, weapply the method shown in Figure 1(c). This methoddivides a TCG into sub-graphs (SGs) based on depen-dency levels. The example shown in Figure 1(c) illus-trates partitioning with two and three dependency lev-els per sub-QUBO1/2 and sub-QUBO3 respectively.The lowest degree of granularity corresponds to onedependency level per sub-QUBO. Further division ofthe problem will distort the concept of optimal par-allel tasks assignment. The weakness of such a parti-tioning is that only communication edges inside a SGare regarded. Thus, multiple communication edges getexcluded from the problem and are not represented inthe qubit matrix. Excluded edges are labelled with redcrosses in Figure 1(c). This may have a signiﬁcant im-pact on the quality of the provided solution, especiallyfor communication-intensive applications.Part of the novelty of our work is improving the par-tition by applying an interactive previous-placement-dependent approach. This approach takes advantage ofdependency level-based partitioning. Sub-QUBOs aresolved one after another and each previous SG place-ment is used to enhance following sub-QUBOs. Ourmapper extends the qubit matrix with additional vir-tual qubits–one per each unique source task of all ex-cluded input edges (edges that are inputs to a SG).This qubit is associated with a speciﬁc PU because theprevious task placement is already known at this point.In Figure 1(c), virtual qubits are shown as red crossesinside the sub-QUBO matrices and missed edges pre-viously shown as crossed out are illustrated with redarrows.Our approach guides the solver towards a better so-lution than is possible with heuristics alone, but doesnot guarantee an optimal solution because the output edges of the sub-graphs are still excluded from the prob-lem and the future placement is not available at thispoint. It should also be emphasized that QUBO mini-mizes the sum of given costs, which are abstract posi-tive numbers. Minimizing the sum does not guaranteethat parallel execution time is also minimized, if thatis determined by the slowest task.3.4 Binary solution interpretationFigure 2 illustrates the binary solution interpretationby mapping the example graph from Figure 1(c) on thefour-unit mesh architecture. Each block corresponds toa dependency level of the task-communication graph. Itcontains three illustrative components, i.e. a qubit sub-matrix with solution values, computation task place-ment corresponding to the solution and communicationtraﬃc based on the prior task placements. In case bothsource and destination tasks are placed on the sameunit, the communication edge is marked as local com-munication. Local communications do not contribute tothe data movement component of the objective func-tion and represent the most favourable assignment forcommunication cost minimization.3.5 Computation and Communication costsComputation and communication costs have been pre-viously discussed as abstract positive numbers. How-ever, the nature of the cost metric determines whetherthe proposed method provides an optimal solution. Ifthe cost is based on delay and the goal of task assign-ment is to minimize time, QUBO minimization will notprovide the optimal placement. This is because QUBOminimizes the sum of the placement costs in each SGand it does not guarantee that if placed in parallel taskexecution time is minimum. For other metrics, such asdata movement, power consumption, energy, the pro-posed method provides an optimal solution. quantum circuits . Figure 3(a) shows an exampleof the quantum circuit.To avoid confusion, the qubitsrepresented on the circuit will be referred to as logical qubits and the real qubits inside a quantum computeras physical qubits. Four horizontal lines represent logi-cal qubit state evolution over time (from left to right).

IGER: Topology-aware Assignment using Ising machines 7

H XH [q3][q2][q1][q0] + XH Z HH+ Z

S i n g l e - q u b i t g a t e s + Z

T w o - q u b i t g a t e s a) Quantum Circuit b) Quantum Circuit as TCG c) Quantum Chip Topologies x S i n g l e - q u b i t g a t e t a s k x.1x.2

T w o - q u b i t g a t e t a s k s

I B M V i g o5 q u b i t s q0 q1q3 q2q4

I B M Q X 25 q u b i t s

Fig. 3: Quantum circuit graph: gate-to-qubit assignment.Single- and two-qubit gates are applied on speciﬁcqubits according to algorithm computations. Quantumcircuits can be transformed into a task-communicationgraph similar to the classical algorithm transforma-tion. In this case, quantum gates represent tasks thathave dependencies (black arrows). Figure 3(b) showsthe Quantum Circuit Graph (QCG) in the form of theTCG. A two-qubit gate becomes two connected tasksin the QCG. Moreover, two-qubit gates are directional,i.e. there are source and destination qubits in the pair.Topology-aware quantum gate assignment is basedon physical qubit connectivity inside the quantum chip.Figure 3(c) shows an example of the 5-qubit chip con-nectivity. Arrows show not only the connection betweentwo physical qubits, but also the supported direction forthe two-qubit gates.Because of the limited connectivitybetween qubits, not all two-qubit gates can be directlyapplied. For example, consider a circuit where a two-qubit gate is applied to logical qubits 0 and 3, and thecircuit is matched to the architecture on Figure 3(c).There are two ways to map the qubits to circuit. Firstis to map the logical qubits to physical in a diﬀerentorder such that logical 0 and 3 are mapped to physical0 and 2. Another is to swap the underlying logical qubitstates, in case if they are already mapped to the archi-tecture in the same order. For instance, if the states ofqubits 2 and 3 are swapped, the physical qubit 2 nowwould contain the state of the logical qubit 3, makingit possible to apply the desired 2-qubit gate.4.2 Fidelity and SWAP operation costsUnlike a classical assignment optimization problem thatminimizes computation and communication costs (de-scribed in Section 3.5), in quantum gate assignment op-timization we target diﬀerent metrics. One of the mostimportant parameters for quantum computations in theNISQ era is ﬁdelity . Circuit ﬁdelity is a measure of howmuch quantum information is preserved [23]. Due to thenoise, the experimentally-obtained output qubit state is diﬀerent from the desired output qubit state whichwould have been obtained in the ideal scenario. Thereis a direct correlation between the number of gates andcircuit ﬁdelity.Typically, in case of superconducting technology,single-qubit gates have higher ﬁdelity than two-qubitgates, which require signiﬁcantly more eﬀort to tuneand improve. Each physical qubit is unique in its prop-erties and has diﬀerent ﬁdelity per gate. The ﬁdelityresulting from mapping logical qubits and their corre-sponding gates to the underlying architecture’s physicalqubits will be referred to as ﬁdelity mapping .There are several types of two-qubit gates.

SWAP gate swap the states between two-qubits. A SWAP gateis usually decomposed into a sequence of three

CNOT two-qubit gates. CNOT belongs to the so-called native set of gates that is supported by the control hardwareand quantum chip technology. The need of this opera-tion is dictated by the nature of quantum computation- it is not possible to make a copy of a qubit state ( no-cloning theorem [28] [36]). A SWAP gate is used to movethe qubit state to the right location. Thus, the num-ber of SWAP operations N swaps is similar to the datamovement (communication) cost of the classical TCG.Consequently, the quantum state movement is requiredto satisfy chip connectivity. This movement comes ata cost, because two-qubit gates are the main source ofinﬁdelity in quantum circuits. The reduction in ﬁdelityresulting from insertion of SWAP gates, each having ﬁ-delity ﬁdelity swap , will be referred to as ﬁdelity movement . f idelity movement = ( f idelity swap ) N swaps f idelity total = f idelity mapping ∗ f idelity movement (2)Since two-qubit gates have lower ﬁdelity, quantumgate assignment optimization can be formulated as N swaps minimization. However, in order to obtain thebest total ﬁdelity for the quantum circuit both of theoptimization parameters need to be taken into account,i.e. gate mapping ﬁdelity and minimum number ofSWAPs. That makes the optimization problem almost A. Butko et al.

P r o b l e mI N P U T Q U B OM a p p e r Q M II n t e r f a c e D e c o m p o s e r S o l v e rA R CT C G D - W a v eT A B Us e a r c hq b s o l vM a p p i n g - t o - M e t r i c s u b - *. q u b o. q m a s m 0 1 0 1 0 0 0 0 1x x x x x x x x x 0 1 0 1 0 0 0 0 1 x x x x x x x x x E x t e r n a l M o d e lli n g T oo l s T I G E R M a pp i n g S o l u t i o n s s i z e < l i m i t v a l u es i z e > l i m i t v a l u e Fig. 4: Topology-aware task assignment using TIGER and quantum annealing.identical to the classical topology-aware task assign-ment on extremely heterogeneous architectures, where ﬁdelity mapping represents computation performance tobe maximized and where N swaps represents the com-munication cost to be minimized. Equation 2 showshow optimization of these two metrics can be refor-mulated as total ﬁdelity ﬁdelity total maximization . Alarge number of recent studies target the total circuitﬁdelity maximization [8]. However, they solve the opti-mization problem of the circuit gate decomposition andassignment to minimize the number of gates, especially SWAP gates, without consideration of ﬁdelity mapping .4.3 Weight Optimization AlgorithmIsing machine weights allow us to vary the priorityof one or another optimization metric. By scaling theweights associated with SWAP minimization, eitherthe qubit ﬁdelity or SWAP reduction can be priori-tized. To scale the weights, a priority coeﬃcient pref is introduced.To arrive at the optimal solutions eitherin terms of the resulting number of SWAP gates in-serted or gate ﬁdelity, we propose an optimization al-gorithm. It searches for the coeﬃcient value that max-imizes ﬁdelity total . Since ﬁdelity total is obtained fromﬁdelity mapping and ﬁdelity movement , the algorithm canalso ﬁnd a solution with maximum ﬁdelity mapping orminimum qubit movement. Due to inﬁdelity of SWAPgates, a solution with minimum N swaps should cor-respond to maximum ﬁdelity total solution. However,in a hypothetical fully-connected architecture wherequbit movement constraint is eliminated, ﬁdelity mapping would correspond to ﬁdelity total . In such a scenario itwould be practical to maximize only mapping ﬁdelity.Optimizing only ﬁdelity mapping or N swaps metric can also give an estimate of the bounds of these metrics incase if no optimal solution is known beforehand. More-over, the proposed optimization algorithm can be suit-able when it is needed to maintain a speciﬁc compu-tation to communication ratio in task assignment, forexample. The pseudocode is given in Algorithm 1 onthe facing page. The search starts with an initial pref-erence coeﬃcient, gets the corresponding metric value,for example ﬁdelity total , and compares it to other solu-tions with a larger and smaller coeﬃcient. The searchspace range is deﬁned by setting the parameter sSpr .How fast the algorithm converges is deﬁned by the pa-rameter sRed , which reduces the search space at everystep. For better local search space exploitation lines 6-17 can be repeated with sSpr = √ sSpr . IGER: Topology-aware Assignment using Ising machines 9

Algorithm 1:

Preference coeﬃcient optimization

Data:

QCG, ARC

Result: fidelity best , pref best sSpr = 2 // search spread, sets the search space range sRed = 0 . // spread reduction, reduces sSpr at everystep for convergence pref best = 0 . // initial preference coefficient fidelity best = tiger ( QCG, ARC, pref ) while sSpr > do pref left = pref/sSpr pref right = pref ∗ sSpr fidelity left = tiger ( QCG, ARC, pref left ) fidelity right = tiger ( QCG, ARC, pref right ) if fidelity left > fidelity best then fidelity best = fidelity left pref best = pref left end if fidelity right > fidelity best then fidelity best = fidelity right pref best = pref right end sSpr = sSpr ∗ sRed end open-source QUBO mapper written in Python. It usesNetworkX python package [7] to create and manipu-late TCG/QCG and ARC structures, i.e. computingthe computation and communication costs for classi-cal problems and ﬁdelity and SWAP costs for quantumproblems taking into account hardware (architecture)topology. We demonstrate TIGER on the D-Wave ma-chine.TIGER receives two ﬁles as inputs (marked as red‘1’ to denote step 1), namely TCG or QCG and ARC(architecture). TCG describes the classical applica-tion’s TCG, QCG describes the quantum algorithm’sQCG, while ARC describes the architecture (hardwaretopology). The format of these ﬁles is presented in Fig-ure 5 (a) and (b). The TCG ﬁle consists of lines of twotypes associated to application tasks and edges. Tasklines contain a task ID and multiple cost values each ofa diﬀerent type, e.g. number of integer, ﬂoating point,memory access instructions. Edge lines contain an edgeID, source and destination task IDs, and a cost value,e.g. the amount of data to be transferred between twotasks in bytes. The architecture ﬁle describes the archi-tecture topology and its details such as number of rowsand columns, number of PUs, and the capabilities ofeach PU and link such as cost per type of instructions,link throughput, etc.Using the algorithm described in Section 3, TIGERmaps input TCG and ARC ﬁles into the QUBO formatand generates the QMI interface ﬁle (step ‘2’). It sup-ports both qmasm and qubo formats and can generatea single ﬁle per problem or multiple ﬁles in case theQUBO partitioning option is chosen. If the size of the t a s k I D [ c o s t 1 ] [ c o s t 2 ] [ c o s t 3 ]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - e d g e I D t a s k _ 1 t a s k _ 2 [ c o s t ]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - P a r a m e t e r I D s V a l u e- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

To p o l o g y M E S HN u m R o w s 2N u m C o l s 2N u m P U s 4P U . 0 . . 3 1 , 2 , 2 , 4L i n k 0 . . 3 2 , 2 , 2 , 2… a) Application TCG file format b) Architecture ARC file format

Fig. 5: Topology-aware task assignment problem input.Table 1: Benchmark suite

Workload Problem size Tasks

Ultrasound 9x5x10 15 15RS-encoder 32x28x8 141 140RS-decoder 32x28x8 526 789 problem is less than the physical limit value, i.e. qubitsub-matrix size, QUBO or sub-QUBO can be directlysolved (step ‘3’). Otherwise, it has to be further de-composed by qbsolv and then solved (step ‘4’). In bothcases the problem is solved by two available solvers: theD-Wave annealer or a TABU search qbsolv implemen-tation (step ‘5’).Finally, the solver generates mapping solutions thatare sent back to the TIGER tool. If the solution corre-sponds to sub-QUBO (step ‘7’), it is used by TIGERto generate the next sub-QUBO as described in Sec-tion 3.2. If the solution is complete (step ‘6’) or thelast sub-QUBO problem is solved, TIGER calculatesthe ﬁnal cost of the assignment through its Mapping-to-Metric (MtoM) interface (step ‘8’). This cost can beused to estimate the quality of the solution.

Fig. 6: Delay-to-Solution evaluation: (I) - classical TABU-search solver w/o TIGER sQ partition, (II) - quantumDW solver w/o TIGER sQ partition and (III) - quantum DW solver with TIGER sQ partition.the COSMIC benchmark suite [33]. Table 1 shows theset of chosen benchmarks and their characteristics.For quantum QCG assignment optimization, we cre-ate the QCG ﬁles formatted for TIGER from the quan-tum benchmark suite [39]. We create ARC ﬁles basedon two IBM quantum chips [15]:

IBM Yorktown (QX2) with 5 qubits and

IBM Vigo with 5 qubits. Figure 3 (c)illustrates these two topologies. The quantum bench-mark suite [39] provides 48 circuits for 5-qubit chips.We reduce the circuit size down to 50 gate.6.2 Tool ﬂow evaluationFor each workload we evaluate three scenarios: (I)

TIGER QUBO mapper - qbsolv decomposer/TABU-search qbsolv solver - TIGER MtoM interpretor, (II)

TIGER QUBO mapper - qbsolv decomposer/DWsolver - TIGER MtoM interpretor and (III)

TIGERQUBO mapper/TIGER SG partitioner - qbsolv decom-poser/ TABU-search qbsolv solver - TIGER MtoM in-terpretor. For each scenario, we vary the size of thearchitecture to a 2 × × × I . In addition, we show the number of logical qubitsand couplers generated by TIGER’s mapper ( qubits and couplers ), the number of partitions provided byqbsolv’s decomposer ( partitions ), and the number ofSGs generated by TIGER’s partitioner ( tiger sQs ).The number of qubits in scenarios I and II is equal,but it is higher in scenario III because additional qubitsare required to deﬁne previous sub-QUBO placementsas shown in Figure 1. Similarly, the number of couplersas well as the number of partitions in scenarios I and II are equal. It is lower in scenario III due to the optimized QUBO mapping. The number of TIGER sub-QUBOsis reported only for scenario

III . In scenarios I and II this TIGER option is not applied ( na ). Discussion:

Performance evaluation results provethat the physical quantum annealer, i.e. DW2X, cansigniﬁcantly reduce delay-to-solution compared to theclassical qbsolv solver. For the given set of bench-marks and architecture conﬁgurations, the performancespeedup of the

DECOMPOSER-SOLVER phase variesbetween 1.2 × and 10.2 × . The major portion of this im-provement is caused by the replacement of the classicalsolver with the quantum annealer. The average valueof the DW2X access time is around 20ms. This timeincludes programming time, sampling time and post-processing time. The sampling phase consists of multi-ple sample batches, each of which includes annealing,readout, and additional delay that allows the quantumannealer to cool down to the initial state. The anneal-ing time is 20 µ s. Although QUBO is solved by a phys-ical quantum annealer, a signiﬁcant amount of timeassociated to the problem decomposition is spent by qbsolv DECOMPOSER . The total D-Wave SOLVER phase is composed of multiple D-Wave accesses, wherethe number of accesses is determined by the number ofpartitioned calls provided by qbsolv DECOMPOSER .Therefore, while using the quantum annealing solverthe delay-to-solution phase highly depends on the qual-ity of the classical decomposition.In scenario

III , we evaluate the impact of thedomain-speciﬁc partitioning integrated into the QUBOmapper, i.e. TIGER level partitioner. Here, reportedvalues represent the sum of all sub-QUBO parametersconcerning the total number of qubits and couplers aswell as delays per phase. Results show that by applyingtwo-level QUBO partitioning (i.e. domain-speciﬁc ﬁrstand classical qbsolv second), a massive speedup in time-to-solution can be achieved. For the given set of TCGsand ARCs, the

DECOMPOSER-SOLVER phase is re-duced down to 6% compared to the baseline scenario.

IGER: Topology-aware Assignment using Ising machines 11(a) Ultrasound-9x5x10(b) Reed-Solomon Encoder-32x28x8(c) Reed-Solomon Decoder-32x28x8

Fig. 7: Task assignment sensitivity and quality of the solution. (DW, single) : DW w/o sQ vs. classical TABU-search w/o sQ, (qbsolv, sQ) : classical TABU-search with sQ vs. classical TABU-search w/o sQ and (DW, sQ) :DW with sQ vs. classical TABU-search w/o sQ.Such an improvement has several sources. First, TIGERpartition signiﬁcantly simpliﬁes the task for qbsolv DE-COMPOSER , which performs better on a smaller sub-set of qubits and coupler tasks than on a single largeproblem. Consequently, qbsolv generates fewer parti-tion calls thereby reducing

D-Wave SOLVER phasedelay. This eﬀect is particularly noticeable for largerTCGs where the number of partitions is reduced twice.The total number of qubits and couplers is also diﬀer-ent compared to the baseline. By applying the mini-mum number of qubits possible and adjusting the levelof granularity (i.e. one sub-level per sub-QUBO), we re-duce the total number of couplers. These improvementsare achieved at the expense of having a larger numberof qubits. This increase is 12% by average compared tothe baseline. On the other hand, additional partitioningcan potentially impact the quality of the generated so-lution. This eﬀect is evaluated in the following section.6.3 Task assignment evaluationWe evaluate the assignment quality and multiple-runsensitivity in three comparison scenarios: (i) singleQUBO on quantum annealer versus classical qbsolv solver ( dw, single ), (ii) partitioned sub-QUBOs versussingle QUBO assignment on classical qbsolv solver ( qb-solv, sQ ), and (iii) partitioned sub-QUBOs on quan-tum annealer versus single QUBO assignment on classi-cal qbsolv solver ( DW, sQ ). Architecture conﬁgurationﬁles represent a 2 ×

2, 4 ×

4, or 8 × × to 4 × . Link cost is equal to 2. Figure 7 showsthe diﬀerence in computation, communication and to-tal costs for the three evaluation scenarios compared tothe baseline. Discussion:

In some cases, we obtain the same so-lution over multiple runs. If diﬀerent solutions are re-turned, usually the variation is within 5% from themean value. For a given set of experiments, DW2Xquantum solver provides solution improvements for asingle QUBO compared to the classical TABU-searchsolver. Results show up to 8% of computation cost im-provement, up to 25% of communication cost improve-ment, and up to 15% of total improvement. Both qb-solv sQ and DW sQ scenarios show similar behaviourin most experiments. However, again DW2X quan-tum solver provides better solutions, e.g. RS-Encoder

Fig. 8: IBM Vigo: mapping ﬁdelity, number of swapsand total ﬁdelity.mapped on 2 × × × for US TCGmapped on 2 ×

2, 4 × × RS Encoder

TCG as shown in Fig-ure 7(b). However, the computation constituent doesnot deteriorate. In both TCGs, task computation costsfar outweigh communication edge cost. For instance, US computation cost ranges between 4,510 and 3,461,112,while communication highest cost is 20, 60 and 140 for2 ×

2, 4 ×

4, and 8 × RS Decoder

TCG iscommunication intensive. The computation cost variesto up to 1,880, while the communication cost reaches14,280 for 8 × swaps and ﬁdelity mapping a set of experiments was performed on the QCGs men-tioned in Section 6.1.2. The preference coeﬃcient variesfrom 0.01 to 30. Figure 8 shows the mapping ﬁdelity(ﬁdelity mapping ), number of swaps (N swaps ) and totalﬁdelity for diﬀerent coeﬃcient values. Smaller coeﬃ-cients minimize qubit state movement, while larger onesprioritize mapping ﬁdelity instead. Black box shows a near-optimum region of the priority coeﬃcient. Usingthe priority coeﬃcient smaller than 0.05 results in in-valid solutions being produced by the algorithm andcan even lead to the opposite eﬀect, increasing N swaps instead. Setting the coeﬃcient larger than 20 providesonly small improvement of ﬁdelity mapping , but it onlyhappens in some architectures and incurs an inadequatenumber of additional SWAPs. Hence, applicable coef-ﬁcient values that produce the minimum N swaps andmaximum ﬁdelity mapping are approximately 0.05 and20, respectively. Total ﬁdelity strongly correlates withthe number of SWAPs and mapping ﬁdelity plays anegligible role in this scenario. Discussion:

Since ﬁdelity movement coming fromN swaps has a larger impact on ﬁdelity total , usuallyN swaps is minimized and gate ﬁdelity is not consid-ered. It means that the priority coeﬃcient that max-imizes ﬁdelity total is the same that minimizes N swaps ,i.e. 0.05. However, as connectivity in quantum com-puting architectures increases, qubit movement mightbecome less signiﬁcant. In such a context maximiza-tion of ﬁdelity total would be entirely dependent onﬁdelity mapping . To tackle any possible scenario, ﬁdelity total can be max-imized regardless of connectivity and gate ﬁdelity. Thepriority coeﬃcient that allows such a maximization isunknown, and can vary for every diﬀerent circuit andarchitecture. We study the proposed weight optimiza-tion algorithm to assess its eﬃciency in ﬁnding the op-timal priority coeﬃcient for a combination of quantumcircuit and device topology.Figure 9 shows total ﬁdelity and number of SWAPsoptimization using WOA algorithm for multiple circuitsfor IBM Vigo and IBM QX2 topologies. The results in-clude initial value at the beginning of the algorithmexecution and the ﬁnal value. For IBM Vigo topology(results in Figure 9 (a) and (b)), the WOA ﬁnds thepriority coeﬃcient that reduces the number of SWAPsfrom the initial step value in 62.5% of cases. In 37.5%of cases the number of SWAPs remains unchanged. Theresults with strong reduction are highlighted in green.In average, WOA improves total ﬁdelity by 39% forIBM Vigo topology. For IBM QX2 topology (results inFigure 9 (c) and (d)), the WOA ﬁnds the priority co-eﬃcient that reduces the number of SWAPs from theinitial step value in 83.3% of cases. In one case the num-ber of SWAPs remains unchanged, and in 14.6% of casesWOA provides weak increase of the SWAPs number. Inaverage, WOA improves total ﬁdelity by 107% for IBMQX2 topology.

IGER: Topology-aware Assignment using Ising machines 13(a) Vigo: Fidelity (b) Vigo: Number of SWAPs(c) QX2: Fidelity (d) QX2: Number of SWAPs

Fig. 9: Quantum gate assignment: wieght optimization algorithm search

Discussion:

The results show signiﬁcant diﬀerencein WOA performance when applied on diﬀerent topolo-gies. While in general WOA allowed us ﬁnding moresuitable combination of QUBO weights (preference co-eﬃcient) for both topologies, IBM QX2 mapping ismuch more sensitive towards priority coeﬃcient choice.Moreover, in few cases WOA missed optimal solutionthat resulted in a weak increase in SWAPs number com-pared to the initial state value. We believe, that the rea-son lies in the complexity of the topology graph thatcalls for the QUBO weights adjustments to ﬁnd themost suitable combination in a near-optimum region.

Finally, we compare the performance of TIGERtopology-aware SWAP optimizer against the IBM QXoptimizer. Figure 10 shows the comparison resultsacross multiple circuits for two topologies, i.e. vigo and qx2 . The numbers show the ﬁnal number of SWAPS.The SWAP reduction color map highlights the caseswhen one of the optimizer provides a better result withthe SWAP number diﬀerences as follow: (i) 1-2 SWAPs,(ii) 3-4 SWAPs, (iii) 5-7 SWAPs or (iv) more than 7.For the vigo topology, TIGER and IBM QX provides same SWAP number in 18.7% of cases; IBM QX outper-forms TIGER in 41.7% of cases with the total reductiondiﬀerence of 51 SWAPs; and TIGER outperforms IBMQX in 39.6% of cases with the total reduction diﬀerenceof 59 SWAPs. For the qx2 topology, TIGER and IBMQX provides same SWAP number only in 4.2% of cases;IBM QX outperforms TIGER in 8.3% of cases with thetotal reduction diﬀerence of 12 SWAPs; and TIGER sig-niﬁcantly outperforms IBM QX in 87.5% of cases withthe total reduction diﬀerence of 260 SWAPs. Moreover,TIGER found the perfect mapping reducing the datamovement to 0 SWAPs in 16.7% of cases, while IBMQX found the perfect matching only in 4.2% of cases.

Discussion:

Similar to the WOA evaluation results(see section 6.4.1), the comparison results show signif-icant diﬀerence when applied on diﬀerent topologies.TIGER allowed us signiﬁcantly improve the mappingfor IBM QX2 topology compared to the IBM QX opti-mizer. We believe, that the reason also lies in the topol-ogy graph complexity. Classical IBM QX optimizer isnot suitable for more complex topologies with a largernumber of potential combinations, while TIGER opti-mizer allows us to ﬁnd the ‘perfect’ mapping regardless. T o p o l ogy O p t i m i zer t - v1_81 4g t t t t t t t - v1_93 4g t t t m o d - v0_18 4 m o d - v0_19 4 m o d - v0_20 4 m o d - v1_22 4 m o d - v1_23 4 m o d - v1_24 4 m o d - v0_94 4 m o d - v1_96 a j- e l u - v0_26 a l u - v0_27 a l u - v1_28 a l u - v1_29 a l u - v2_31 a l u - v2_32 a l u - v2_33 a l u - v3_34 a l u - v3_35 a l u - v4_36 a l u - v4_37 d ec o d - bdd _294 d ec o d - v1_41 d ec o d - v3_45 h w b m i n i - a l u _167 m o d m o d m o d d m o d d m o d m il s _65 o n e -t w o -t h ree - v0_97 o n e -t w o -t h ree - v0_98 o n e -t w o -t h ree - v1_99 o n e -t w o -t h ree - v2_100 o n e -t w o -t h ree - v3_101 r d vigo TIGER

16 18 8 6 4 14 14 17 18 18 20 13 18 11 5 4 15 10 16 16 18 17 11 13 12 18 14 13 19 11 16 11 15 15 17 19 14 15 17 7 17 10 18 18 18 18 19 16

IBM QX

16 16 12 9 6 21 21 17 15 15 19 21 16 12 5 8 15 12 15 17 21 18 12 12 12 15 17 9 16 12 16 12 14 14 14 13 18 20 13 7 17 10 16 13 16 15 18 17 qx2 TIGER

IBM QX

10 6 4 1 0 0 0 11 11 6 9 11 10 6 5 5 13 6 10 11 6 12 5 5 5 10 7 2 8 5 10 5 5 10 10 13 6 11 10 8 8 9 6 10 10 11 16 6

SWAP Reduction Difference (Color Map) 1-2 3-4 5-7 >7

Fig. 10: Optimizer comparison: TIGER vs. IBM QX

In this paper, we propose an algorithm for solvingthe topology-aware task/gate assignment problem onphysical Ising machines in order to accelerate andimprove the quality of the solution to this challeng-ing NP-complete problem. We implement our solu-tion in our TIGER tool that transforms weightedtask-communication, quantum circuit, and architecturegraphs into an appropriate format of the Hamiltonianfunction. Our solution takes into account both compu-tation and communication costs for the classical prob-lem or ﬁdelity and SWAP number for the quantumproblem. We evaluate the proposed approach using D-Wave’s quantum annealer. In order to overcome exist-ing physical limitations of current quantum annealers,we propose domain-speciﬁc partitioning based on thetask-communication graph dependency levels. Also, wepropose weight optimization algorithm that enables ad-justing the model parameters and ﬁnd better solutions.We integrate TIGER into the D-Wave software stackthat enables us to apply both our proposed dependency-level partitioning as well as the partitioning provided bythe qbsolv tool in a dynamic iterative way. We demon-strate that our method can reach 15% higher-qualitysolutions 9% faster compared to the classical qbsolvheuristic algorithm. Finally, TIGER reduces the datamovement cost by 68% in average for quantum circuitassignment compared to the IBM QX optimizer [15].Our work alleviates the concern that task mapping mayhinder high-quality solutions on future quantum accel-erators with more physical qubits and complex connec-tivity. The TIGER tool is publicly available online .For future work, we consider three major directions: – Comparison to a wide range of classicalscheduling tools : we plan to design a methodol-ogy to compare the hardware optimizer, i.e. Isingmachine, to existing heuristic software tools. https://github.com/lbnlcomputerarch/tiger – Use other Ising machines : we plan to expand ourstudy running the problem on other Ising machines,such as digital annealer [9] and coherent Ising ma-chine [37]. – Problem partitioning algorithms and addi-tional constrains mapping : we plan to evaluateadditional graph partitioning algorithms and alter-native problem mapping algorithms, e.g. assigningmultiple tasks in one node based on the node capac-ity. Acknowledgements

The research leading to these resultshas received funding from the the U.S. Department of En-ergy, grant agreement n o DE-AC02-05CH11231.

References

1. Bokhari, S.H.: On the mapping problem. IEEE Trans-actions on Computers

C-30 (3), 207–214 (1981). DOI10.1109/TC.1981.16757562. Booth, M., Reinhardt, S.P., Roy, A.: Partitioning opti-mization problems for hybrid classical/quantum execu-tion. Tech. rep. (2017)3. Burkard, R., Dell’Amico, M., Martello, S.: AssignmentProblems. Society for Industrial and Applied Mathemat-ics, PA, USA (2009)4. Chan, C.P., Bachan, J.D., Kenny, J.P., Wilke, J.J., Beck-ner, V.E., Almgren, A.S., Bell, J.B.: Topology-aware per-formance optimization and modeling of adaptive meshreﬁnement codes for exascale. In: 2016 First Interna-tional Workshop on Communication Optimizations inHPC (COMHPC), pp. 17–28 (2016)5. Daskalakis, C., Dikkala, N., Kamath, G.: Testing isingmodels. IEEE Transactions on Information Theory pp.1–1 (2019). DOI 10.1109/TIT.2019.29322556. Denchev, V.S., Boixo, S., Isakov, S.V., Ding, N., Bab-bush, R., Smelyanskiy, V., Martinis, J., Neven, H.: Whatis the computational value of ﬁnite-range tunneling?Phys. Rev. X , 031015 (2016). URL https://link.aps.org/doi/10.1103/PhysRevX.6.031015 (4), 045003 (2018). DOI 10.1088/2058-9565/aacf0b11. Glover, F., Laguna, M.: Tabu Search. Kluwer AcademicPublishers, Norwell, MA, USA (1997)12. Glover, F.W., Kochenberger, G.A.: A tutorial on formu-lating qubo models. ArXiv abs/1811.11538 (2018)13. Hoeﬂer, T., Snir, M.: Generic topology mapping strate-gies for large-scale parallel architectures. In: Proceedingsof the International Conference on Supercomputing, ICS’11, pp. 75–84. ACM, New York, NY, USA (2011). URL http://doi.acm.org/10.1145/1995896.1995909

14. Hwang, F.K.: The hamiltonian property of linear func-tions. Oper. Res. Lett. (3), 125–127 (1987). DOI 10.1016/0167-6377(87)90024-1. URL http://dx.doi.org/10.1016/0167-6377(87)90024-1 C-36 (4), 433–442 (1987)21. Li, G., Ding, Y., Xie, Y.: Tackling the Qubit MappingProblem for NISQ-Era Quantum Devices. arXiv e-printsarXiv:1809.02573 (2018)22. Mandr`a, S., Zhu, Z., Wang, W., Perdomo-Ortiz, A., Katz-graber, H.G.: Strengths and weaknesses of weak-strongcluster problems: A detailed overview of state-of-the-artclassical heuristics versus quantum approaches. ArXive-prints (2), 022337 (2016)23. Markov, I.L., Fatima, A., Isakov, S.V., Boixo, S.: Quan-tum Supremacy Is Both Closer and Farther than It Ap-pears. arXiv e-prints arXiv:1807.10749 (2018)24. Nielsen, M.A., Chuang, I.L.: Quantum Computation andQuantum Information: 10th Anniversary Edition, 10thedn. Cambridge University Press, New York, NY, USA(2011)25. Orduna, J.M., Silla, F., Duato, J.: A new task mappingtechnique for communication-aware scheduling strate-gies. In: Proceedings International Conference on ParallelProcessing Workshops, pp. 349–354 (2001)26. Pakin, S.: A quantum macro assembler. In: 2016IEEE High Performance Extreme Computing Conference(HPEC), pp. 1–8 (2016)27. Pakin, S., Reinhardt, S.P.: A survey of programmingtools for d-wave quantum-annealing processors. In:R. Yokota, M. Weiland, D. Keyes, C. Trinitis (eds.) HighPerformance Computing, pp. 103–122. Springer Interna-tional Publishing, Cham (2018)28. Park, J.L.: The concept of transition in quantum mechan-ics. Foundations of Physics (1), 23–33 (1970). DOI10.1007/BF00708652. URL https://doi.org/10.1007/BF00708652

29. Salimi, R., Motameni, H., Omranpour, H.: Task schedul-ing with load balancing for computational grid using nsga ii with fuzzy mutation. In: 2012 2nd IEEE InternationalConference on Parallel, Distributed and Grid Computing(2012)30. Schaeﬀer, S.E.: Survey: Graph clustering. Comput. Sci.Rev. (2007)31. Taura, K., Chien, A.: A heuristic algorithm for map-ping communicating tasks on heterogeneous resources.In: Proceedings 9th Heterogeneous Computing Workshop(HCW 2000) (Cat. No.PR00556), pp. 102–115 (2000)32. Tran, T.T., Do, M., Rieﬀel, E.G., Frank, J., Wang,Z., O’Gorman, B., Venturelli, D., Beck, J.C.: A hybridquantum-classical approach to solving scheduling prob-lems. In: Ninth Annual Symposium on CombinatorialSearch (2016)33. Wang, Z., Liu, W., Xu, J., Li, B., Iyer, R., Illikkal, R.,Wu, X., Mow, W.H., Ye, W.: A case study on the commu-nication and computation behaviors of real applicationsin noc-based mpsocs. In: 2014 IEEE Computer SocietyAnnual Symposium on VLSI, pp. 480–485 (2014)34. Wayne Bollinger, S., Midkiﬀ, S.: Processor and link as-signment in multicomputers using simulated annealing.In: ICPP, vol. 1, pp. 1–7 (1988)35. Wille, R., Burgholzer, L., Zulehner, A.: Mapping Quan-tum Circuits to IBM QX Architectures Using the Mini-mal Number of SWAP and H Operations. arXiv e-printsarXiv:1907.02026 (2019)36. Wootters, W.K., Zurek, W.H.: A single quantum cannotbe cloned. Nature (5886), 802–803 (1982). DOI10.1038/299802a0. URL http://dx.doi.org/10.1038/299802a0

37. Yamamoto, Y., Aihara, K., Leleu, T., Kawarabayashi,K.i., Kako, S., Fejer, M., Inoue, K., Takesue, H.: Co-herent ising machines—optical neural networks operat-ing at the quantum limit. npj Quantum Information (1), 49 (2017). DOI 10.1038/s41534-017-0048-9. URL https://doi.org/10.1038/s41534-017-0048-9

38. Zick, K.M., Shehab, O., French, M.: Experimental quan-tum annealing: case study involving the graph isomor-phism problem. Scientiﬁc Reports , 11168 EP – (2015).URL http://dx.doi.org/10.1038/srep11168http://dx.doi.org/10.1038/srep11168