TIGER: Topology-aware Assignment using Ising machines Application to Classical Algorithm Tasks and Quantum Circuit Gates
Anastasiia Butko, Ilyas Turimbetov, George Michelogiannakis, David Donofrio, Didem Unat, John Shalf
TTIGER: Topology-aware Assignment using Ising machines
Application to Classical Algorithm Tasks and Quantum Circuit Gates
Anastasiia Butko · Ilyas Turimbetov · George Michelogiannakis · DavidDonofrio · Didem Unat · John Shalf
September 20, 2020
Abstract
Optimally mapping a parallel application tocompute and communication resources is increasinglyimportant as both system size and heterogeneity in-crease. A similar mapping problem exists in gate-basedquantum computing where the objective is to map tasksto gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing taskassignment approaches are either heuristic or based onphysical optimization algorithms, providing differentspeed and solution quality trade-offs. Ising machinessuch as quantum and digital annealers have recentlybecome available and offer an alternative hardware so-lution to solve this type of optimization problems. Inthis paper, we propose an algorithm that allows solv-ing the topology-aware assignment problem using Isingmachines. We demonstrate the algorithm on two usecases, i.e. classical task scheduling and quantum circuitgate scheduling. TIGER—topology-aware task/gate as-signment mapper tool—implements our proposed al-gorithms and automatically integrates them into thequantum software environment. To address the limi-tations of physical solver, we propose and implementa domain-specific partition strategy that allows solv-ing larger-scale problems and a weight optimization al-gorithm that allows tuning Ising model parameters toachieve better restuls. We use D-Wave’s quantum an-nealer to demonstrate our algorithm and evaluate theproposed tool flow in terms of performance, partition
A. Butko · G. Michelogiannakis · D. Donofrio · J. ShalfLawrence Berkeley National LaboratoryBerkeley CA 94720, USA E-mail: { abutko,mihelog,ddonofrio,jshalf } @lbl.govI. Turimbetov · D. UnatKo¸c University, Istanbul 34450, TurkeyE-mail: { iturimbetov18,dunat } @ku.edu.tr efficiency, and solution quality. Results show significantspeed-up compared to classical solutions, better scala-bility, and higher solution quality when using TIGERtogether with the proposed partition method. It reducesthe data movement cost by 68% in average for quan-tum circuit assignment compared to the IBM QX opti-mizer [15]. Keywords
Topology-aware task assignment · gatescheduling optimization · Ising machine · quantumannealing. The task assignment problem aims to maximize appli-cation performance by balancing computational loadamong multiple and often heterogeneous processingunits while reducing compute overhead. The task as-signment problem has been shown to be equivalent toa graph isomorphism problem by Bokhari [1], whichis known to be NP-complete [20,13]. Therefore, manysolvers for this problem are heuristic [31] that inevitablytradeoff solution quality for computation speed, orphysical optimization algorithms, such as simulated an-nealing [34], genetic techniques [25], and others. Inaddition, solvers can have different optimization met-rics that are often contradictory, such as computa-tional load, communication cost, or a weighted com-bination [29,4].Scheduling quantum gates onto physical qubits issimilarly a challenging problem, given the complexityand variety of quantum operations and physical restric-tions of each quantum chip. To keep operations efficient,quantum gates should be scheduled on quantum hard-ware such as to minimize the number of operations andmaximize quantum circuit fidelity (how much quantum a r X i v : . [ c s . ET ] S e p A. Butko et al. information is preserved), while taking into accountthe connectivity between physical qubits [10]. Conse-quently, many mapping algorithms scale poorly due toruntime, memory usage, and the quality of their gen-erated solutions [21]. In addition, the quality of theirsolutions compared to the theoretical optimal is un-known [35]. These challenges indicate that gate assign-ment may hinder high-quality solutions on future quan-tum accelerators with more physical qubits and com-plex connectivity.While genetic algorithms and simulated annealingare often considered best practices, recent Ising ma-chines offer an alternative hardware solution for a set ofoptimization problems, such as task scheduling. TheseIsing machines can be implemented using different tech-nologies and exploit various physical effects. Such ex-amples include coherent Ising machines [37], Fujitsu’sdigital annealer [9], and quantum annealers designedby D-Wave Systems Inc. [16]. Several studing on quan-tum annealers [22] [19] explore its capabilities and lim-itations projecting the potential of these machines forfuture use.Despite the potential benefits offered by quantumannealers combined with a growing interest in alterna-tive solutions, practical applicability of annealing ma-chines remains highly questionable. One of the reasonsis physical limitations of current machines, namely therelatively small size of the chip and the poor connec-tivity between qubits [19]. Problem sizes demonstratedin comparison studies are usually not competitive withthose handled by classical solvers. Therefore, effectiveproblem partitioning and post-processing are requiredto continue exploiting quantum solver capabilities whilethe solution for physical limitations is sought [38]. Thatmakes most of the near-term quantum annealing-basedapproaches classical-quantum hybrids.Another obstacle towards wide-spread quantum an-nealer adoption is programming complexity. Its pro-gramming model is based on the Quadratic Uncon-strained Binary Optimization (QUBO) [12] model thatis different form the conventional programming and re-quires special approaches. The highest level that usersare required to program D-Wave is “virtual” QUBO,where “virtual” means that the compiler takes care ofmapping and routing the problem while taking into ac-count device connectivity. Transforming a problem intoQUBO format is not a trivial task. Higher-level tools aswell as efficient algorithms are typically required [27].In this work, we present the Topology-aware task as-sIGnment mappER (TIGER) to solve the assignmentproblem using Ising machines. Namely, our contribu-tions are: – We develop an algorithm to assign Task-Communication Graph (TCG) to the architectureunits minimizing the required data-movementand maximizing the performance. The assignmentproblem is expressed in the QUBO format to beused by an Ising machine. – We develop an algorithm to assign Quantum CircuitGraph (QCG) to the qubits minimizing data move-ment (number of SWAP operations) and miximizingthe fidelity. The assignment problem is expressed inthe QUBO format to be used by an Ising machine. – We develop a domain-specific QUBO partitioningalgorithm (sub-QUBO) based on the graph depen-dency levels to overcome current physical limita-tions of existing quantum annealers and acceleratethe solution search. – We develop a weight optimization algorithm (WOA)to tune Ising equation parameters in order to priori-tize target metrics and adjust them to obtain bettersolutions. – We implement these algorithms as a TIGER tool.TIGER is written in Python and uses the NetworkXpackage [7] to create and manipulate TCG/QCGand ARC structures. – We integrate TIGER into the D-Wave tool-flow bysupporting qbsolv qubo [2], qmasm [26] formats andcreating a feedback loop from D-Wave to TIGER inorder to evaluate the solution for further optimiza-tions. – We evaluate the proposed algorithms and its im-plementation using D-Wave quantum annealer. Wecompare the D-Wave solver performance and qual-ity of the task assignment (solution) to the classicalTABU-search algorithm. We evaluate the quality ofthe quantum circuits assignment in terms of the cir-cuit fidelity using real IBM systems [15] and com-pare it against IBM QX gate optimizer. Our resultsshow that TIGER with the D-Wave annealer pro-vides up to 8% of computation cost improvementand up to 25% of communication cost improvementcompared to the classical TABU-search solver whenassigning a TCG. It reduces the data movement costby 68% in average for quantum circuit assignmentcompared to the IBM QX optimizer [15].Given the relatively small size of the evaluatedquantum annealer, we leave the discussion on generalcompetitiveness of quantum annealers against classicalcomputing out of the scope of this paper. Our resultsaim to provide useful insights on the entire tool-flowincluding classical decomposition, domain-specific par-tition and QUBO solvers. Last but not least, we wouldlike to extend an invitation to the community to useTIGER and then contribute back to aid tool growth.
IGER: Topology-aware Assignment using Ising machines 3
Latest updates, documentation, and support can befound online .The rest of the paper is organized as follows: Sec-tion 2 provides the background on the existing Isingmachines. Section 3 and Section 4 describe the pro-posed task assignment and quantum gate assignmentmapping approaches, respectively. Section 5 describesTIGER tool implementation as well as its integrationinto the complete tool-flow with the D-Wave program-ming environment. Section 6 shows performance, qual-ity, sensitivity and scalability evaluation results. Section7 concludes the work. Ising machines are special-purpose processors that solvethe Ising model, an intensely-studied NP-completeproblem that is a system of interacting classicalspins [5]. An Ising model is mathematical model com-posed of a large lattice of sites, where each site can bein one of two states. This model can be used to modelthe impact to the global state of the system caused bychanges to parameters (such as connectivity and desiredoperations). Ising models have been used to express andperform computation with different materials such aslasers and magnets, but are also the basis of severalquantum accelerators because they are a natural fit toexpress a graph of interconnected qubits.
Quantum annealing [18] is a metaheuristic techniquefor solving local search problems, such as finding theglobal minimum or maximum in a discrete search space.Quantum annealing offers potential benefits comparedto popular heuristic algorithms through its quantumtunneling effect. This effect allows the system to pen-etrate energy barriers escaping from the local minimaand therefore find better solutions to the original opti-mization problem.A quantum annealing machine or a quantum an-nealer is a hardware implementation of the adiabaticquantum computing algorithm. Quantum annealers op-erate on a set of qubits. A qubit is a two-state quantum-mechanical system that can carry states | (cid:105) and | (cid:105) orbe in superposition that expresses a linear superposi-tion of the ”basis states”, i.e. | (cid:105) and | (cid:105) . This featureforms the key power of quantum machines, which with n qubits can be in an arbitrary superposition of up to 2 n different states simultaneously. Another inherent quan-tum property of qubits is quantum entanglement where https://github.com/lbnlcomputerarch/tiger a group of qubits is coupled to each other in such away that the state of each qubit cannot be perceivedseparately, but as a whole system state instead [24].Quantum annealers provided by D-Wave SystemsInc. have been commercially available since 2011 [16].D-Wave quantum chips are implemented using super-conducting technology and require an extreme isolatedenvironment with a temperature close to absolute zero.A closed cycle dilution refrigerator cools the proces-sor down to 15 mK. Therefore, while the actual quan-tum chip is the size of a stamp, the physical volumeof the whole D-Wave system reaches 20 m . However,D-Wave machines consume less than 25 kW of power,mostly for cooling and front-end servers [17]. In around10 years, quantum annealing chips have reached 10 number of qubits, promising significant performanceimprovement for certain computing problems in thenear future. Physically, qubits are connected to eachother using a so-called Chimera topology. The small-est Chimera unit contains a complete bipartite graphof eight vertices, each of which is connected to its fourneighbours inside the unit and to its two neighboursoutside the unit.In [6], authors compare the performance of physicalquantum annealer (D-Wave 2X quantum annealer) tosimulated annealing and quantum Monte Carlo meth-ods executed on a classical processor.Furthermore, authors in [22] extend Google Inc.studies by comparing quantum annealing to state-of-the art optimization methods, introducing more sophis-ticated assessment metrics. Their work considers fourcategories of optimization methods: sequential meth-ods that include quantum annealing, simulated an-nealing and quantum Monte Carlo, tailored methodsthat solve simplified optimization problems, and non-tailored methods that are generic and thus represent thestate of the art. Authors conclude that physical quan-tum annealing has better scaling compared to other se-quential optimization methods, but it concedes to tai-lored as well as non-tailored state-of-the-art methods.Also, authors emphasize the importance of determin-ing the application domain where quantum annealingmaximizes its benefits, but this has yet to be defined.Finally, King et al. in [19] introduce a problem classthat can maximize usefulness of the quantum tunnelingeffect. Authors again compare quantum annealers toclassical solvers and demonstrate three to four orders ofmagnitude performance speed-up in favor of quantumannealing.Several studies demonstrate the use of quantumannealing for task scheduling. In [32], authors intro-duce a hybrid quantum-classical approach to solvingscheduling problems. Their framework integrates quan- A. Butko et al.
4 T A S K S q q q q q q q q q q q q q q q q P R O C E S S I N G U N I T S T A S K S
4 P U s a) QAP mapping on QUBO
5 T A S K S
2 P U s q q q q q q q q q q b) TCG mapping on QUBO c) TCG partitioning and mapping on QUBO e d g e s 01 2 34 5 6 789 10 S G 1S G 2S G 3
X X X XX XX X q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q S u b - Q U B O 1S u b - Q U B O 2 S u b - Q U B O 3 i n p u t e d g e s i n p u t e d g e s
Fig. 1: Task Communication Graph (TCG) assignment on a heterogeneous multi-PU system: problem mappingon QUBO.tum annealing with classical computing into a guidedtree search. Classical algorithms manage a global treesearch and communicate the node search in QUBO for-mat to the quantum annealer. Authors test the pro-posed framework on three scheduling problems, i.e.graph-coloring, mars lander task scheduling, and air-port runway scheduling. Results show that the quan-tum annealer’s output can effectively prune and guidethe search process. Authors motivate their work by ne-cessity to expand on the capabilities of current quantumannealers and do not expect quantum annealers to becompetitive in the near-term against classical comput-ers.In our work, we address a different schedulingproblem, i.e. topology-aware assignment. The proposedTIGER framework extends existing software environ-ments by automatically generating and dynamically ad-justing QUBO files. We evaluate the tool flow in termsof quantum solver performance, the quality of task/gateassignment and discuss the potential scalability of near-term machines.2.1 Problem formulation and programmingQuantum annealers minimize the QUBO problem de-scribed by Equation 1. The equation describes the evo-lution of the time-dependent Hamiltonian [14] that aimsto find low-energy states in a system of N interactingspins, i.e. qubits. In Equation 1, q i represents qubitsthat take value from the set { , } , h i is a weight co-efficient associated with each qubit, J ij denotes the strength of the couplings between two qubits q i and q j and N is the number of qubits. E ( q , ..., q N ) = N (cid:88) i =1 h i · q i + N (cid:88) i IGER: Topology-aware Assignment using Ising machines 5 02 13 c o m p u t a t i o n t a s k p l a c e m e n t 02 13 02 13 02 13 02 13 02 13 02 13 23 1 746 5 8 9 10 N o i n p u te d g e s s o u r c e u n i t d e s t i n a t i o n u n i t i n t e r - u n i t c o m m u n i c a t i o n l o c a l c o m m u n i c a t i o n Fig. 2: Binary solution interpretation: computation task assignment and communication impact. Q represents thepermutation matrix X , where each qubit defines the as-signment of a task to a specific PU similar to x ij above.An x ij value of 1 represents that task i was assigned toPU j . A weight coefficient h i (not shown) represents thecomputational cost of the assignment. Since solvers incurrent machines find local minima, we transform pos-itive computation costs into negative numbers to pre-vent the solver from giving all-zero answers. To respectassignment constraints such as assigning one task toone qubit, we use qubit couplings and give them highpenalty values such as J ij >> | h i | . For example, toprevent task 0 from being placed on multiple PUs, wecouple qubits ( q · q ), ( q · q ), ( q · q ), ( q · q ), ( q · q )and ( q · q ) for four qubits. Therefore, if two of thesequbits have the same task assigned to them, the largepenalty value will make the overall solution ineligible.3.2 Task-communication graph assignmentApplications can be represented as a weighted directedacyclic graph, usually referred to as a Task Commu-nication Graph (TCG). A TCG is defined as a tuple G = ( V, E ), where V = ( v i ) is a set of weighted verticeswith the weight representing task computational cost,and E = ( e i,j ) is a set of weighted edges with the weightrepresenting inter-task communication cost. An exam-ple of TCG is shown in the upper part of Figure 1(b).Mapping of such as TCG into QUBO differs frompreviously shown LAP in three aspects. First, a TCGincludes not only computation cost, but also inter-task communication cost expressed with graph edges. Sec-ond, not all tasks are assigned to PUs within the sametime frame. A TCG is divided into multiple dependencylevels each of which represents a LAP. Dependency lev-els (groups) are shown with red dashed lines. Third,within each dependency level, the number of indepen-dent tasks can be different compared to the numberof available PUs. The QUBO mapping transformationrespects each of the above three constraints. Communication edges. Each communication edge isincluded into QUBO by qubit coupling. Communica-tion cost is represented by coupling strength. Total end-to-end cost is calculated based on the weight of eachedge in the communication path. If both source anddestination tasks are assigned to the same PU, commu-nication cost is equal to zero. This the most favourablecase if the objective is to minimize data movement. Forthe example in Figure 1(b), to define the edge between task0 and task1 we couple qubits ( q · q ) and ( q · q )with the associated topology-aware communication costand qubits ( q · q ) and ( q · q ) with zero communica-tion cost. Here, cost values are converted to negativenumbers similar to computation cost values. The rela-tive priority of communication and computation costscan be formulated by adding a weight factor to bias thesolver. Dependency levels. Because of dependencies, onlya certain number of tasks can be assigned to PUs inparallel. This relaxes the second assignment constraintthat says that no more than one task can be placed ata PU. This constraint is valid only for tasks belongingto the same dependency group. For the example shownin Figure 1(b), task 0 is separated from task 1 and task2 with a red dashed line. Thus, we couple only qubits( q · q ) and ( q · q ) with a high penalty cost to preventplacing them on the same PU, which would otherwisebe a valid solution for the solver. The first assignmentconstraint that says that a task can not be placed onmultiple PUs at the same time remains unchanged. Level adjustments. When the number of paralleltasks exceeds the number of available computing re-sources, an important decision has to be taken to priori- A. Butko et al. tize a set of tasks in the most efficient way. This decisionis reflected in the qubit matrix, i.e. the order of columnsassociated to specific tasks and corresponding assign-ment constrain couplings. Multiple approaches exist inthe field, but this study is out of the scope of this pa-per. Here, we apply a simple cut based on the task IDincrement. Figure 1(b) illustrates the case in which task4 belongs to dependency level 1, but is moved to thenext level. In case there are no available slots in thefollowing group of tasks, an additional level is created.3.3 Domain-specific TCG partitionGiven the number of logical qubits together with thepotential number of couplings and constrains per singleproblem, we quickly exhaust the physical capabilitiesof quantum machines. Therefore, an intelligent prob-lem partition is required. There has been extensive re-search on graph partitioning [30]. In this context, weapply the method shown in Figure 1(c). This methoddivides a TCG into sub-graphs (SGs) based on depen-dency levels. The example shown in Figure 1(c) illus-trates partitioning with two and three dependency lev-els per sub-QUBO1/2 and sub-QUBO3 respectively.The lowest degree of granularity corresponds to onedependency level per sub-QUBO. Further division ofthe problem will distort the concept of optimal par-allel tasks assignment. The weakness of such a parti-tioning is that only communication edges inside a SGare regarded. Thus, multiple communication edges getexcluded from the problem and are not represented inthe qubit matrix. Excluded edges are labelled with redcrosses in Figure 1(c). This may have a significant im-pact on the quality of the provided solution, especiallyfor communication-intensive applications.Part of the novelty of our work is improving the par-tition by applying an interactive previous-placement-dependent approach. This approach takes advantage ofdependency level-based partitioning. Sub-QUBOs aresolved one after another and each previous SG place-ment is used to enhance following sub-QUBOs. Ourmapper extends the qubit matrix with additional vir-tual qubits–one per each unique source task of all ex-cluded input edges (edges that are inputs to a SG).This qubit is associated with a specific PU because theprevious task placement is already known at this point.In Figure 1(c), virtual qubits are shown as red crossesinside the sub-QUBO matrices and missed edges pre-viously shown as crossed out are illustrated with redarrows.Our approach guides the solver towards a better so-lution than is possible with heuristics alone, but doesnot guarantee an optimal solution because the output edges of the sub-graphs are still excluded from the prob-lem and the future placement is not available at thispoint. It should also be emphasized that QUBO mini-mizes the sum of given costs, which are abstract posi-tive numbers. Minimizing the sum does not guaranteethat parallel execution time is also minimized, if thatis determined by the slowest task.3.4 Binary solution interpretationFigure 2 illustrates the binary solution interpretationby mapping the example graph from Figure 1(c) on thefour-unit mesh architecture. Each block corresponds toa dependency level of the task-communication graph. Itcontains three illustrative components, i.e. a qubit sub-matrix with solution values, computation task place-ment corresponding to the solution and communicationtraffic based on the prior task placements. In case bothsource and destination tasks are placed on the sameunit, the communication edge is marked as local com-munication. Local communications do not contribute tothe data movement component of the objective func-tion and represent the most favourable assignment forcommunication cost minimization.3.5 Computation and Communication costsComputation and communication costs have been pre-viously discussed as abstract positive numbers. How-ever, the nature of the cost metric determines whetherthe proposed method provides an optimal solution. Ifthe cost is based on delay and the goal of task assign-ment is to minimize time, QUBO minimization will notprovide the optimal placement. This is because QUBOminimizes the sum of the placement costs in each SGand it does not guarantee that if placed in parallel taskexecution time is minimum. For other metrics, such asdata movement, power consumption, energy, the pro-posed method provides an optimal solution. quantum circuits . Figure 3(a) shows an exampleof the quantum circuit.To avoid confusion, the qubitsrepresented on the circuit will be referred to as logical qubits and the real qubits inside a quantum computeras physical qubits. Four horizontal lines represent logi-cal qubit state evolution over time (from left to right). IGER: Topology-aware Assignment using Ising machines 7 H XH [q3][q2][q1][q0] + XH Z HH+ Z S i n g l e - q u b i t g a t e s + Z T w o - q u b i t g a t e s a) Quantum Circuit b) Quantum Circuit as TCG c) Quantum Chip Topologies x S i n g l e - q u b i t g a t e t a s k x.1x.2 T w o - q u b i t g a t e t a s k s I B M V i g o5 q u b i t s q0 q1q3 q2q4 I B M Q X 25 q u b i t s Fig. 3: Quantum circuit graph: gate-to-qubit assignment.Single- and two-qubit gates are applied on specificqubits according to algorithm computations. Quantumcircuits can be transformed into a task-communicationgraph similar to the classical algorithm transforma-tion. In this case, quantum gates represent tasks thathave dependencies (black arrows). Figure 3(b) showsthe Quantum Circuit Graph (QCG) in the form of theTCG. A two-qubit gate becomes two connected tasksin the QCG. Moreover, two-qubit gates are directional,i.e. there are source and destination qubits in the pair.Topology-aware quantum gate assignment is basedon physical qubit connectivity inside the quantum chip.Figure 3(c) shows an example of the 5-qubit chip con-nectivity. Arrows show not only the connection betweentwo physical qubits, but also the supported direction forthe two-qubit gates.Because of the limited connectivitybetween qubits, not all two-qubit gates can be directlyapplied. For example, consider a circuit where a two-qubit gate is applied to logical qubits 0 and 3, and thecircuit is matched to the architecture on Figure 3(c).There are two ways to map the qubits to circuit. Firstis to map the logical qubits to physical in a differentorder such that logical 0 and 3 are mapped to physical0 and 2. Another is to swap the underlying logical qubitstates, in case if they are already mapped to the archi-tecture in the same order. For instance, if the states ofqubits 2 and 3 are swapped, the physical qubit 2 nowwould contain the state of the logical qubit 3, makingit possible to apply the desired 2-qubit gate.4.2 Fidelity and SWAP operation costsUnlike a classical assignment optimization problem thatminimizes computation and communication costs (de-scribed in Section 3.5), in quantum gate assignment op-timization we target different metrics. One of the mostimportant parameters for quantum computations in theNISQ era is fidelity . Circuit fidelity is a measure of howmuch quantum information is preserved [23]. Due to thenoise, the experimentally-obtained output qubit state is different from the desired output qubit state whichwould have been obtained in the ideal scenario. Thereis a direct correlation between the number of gates andcircuit fidelity.Typically, in case of superconducting technology,single-qubit gates have higher fidelity than two-qubitgates, which require significantly more effort to tuneand improve. Each physical qubit is unique in its prop-erties and has different fidelity per gate. The fidelityresulting from mapping logical qubits and their corre-sponding gates to the underlying architecture’s physicalqubits will be referred to as fidelity mapping .There are several types of two-qubit gates. SWAP gate swap the states between two-qubits. A SWAP gateis usually decomposed into a sequence of three CNOT two-qubit gates. CNOT belongs to the so-called native set of gates that is supported by the control hardwareand quantum chip technology. The need of this opera-tion is dictated by the nature of quantum computation- it is not possible to make a copy of a qubit state ( no-cloning theorem [28] [36]). A SWAP gate is used to movethe qubit state to the right location. Thus, the num-ber of SWAP operations N swaps is similar to the datamovement (communication) cost of the classical TCG.Consequently, the quantum state movement is requiredto satisfy chip connectivity. This movement comes ata cost, because two-qubit gates are the main source ofinfidelity in quantum circuits. The reduction in fidelityresulting from insertion of SWAP gates, each having fi-delity fidelity swap , will be referred to as fidelity movement . f idelity movement = ( f idelity swap ) N swaps f idelity total = f idelity mapping ∗ f idelity movement (2)Since two-qubit gates have lower fidelity, quantumgate assignment optimization can be formulated as N swaps minimization. However, in order to obtain thebest total fidelity for the quantum circuit both of theoptimization parameters need to be taken into account,i.e. gate mapping fidelity and minimum number ofSWAPs. That makes the optimization problem almost A. Butko et al. P r o b l e mI N P U T Q U B OM a p p e r Q M II n t e r f a c e D e c o m p o s e r S o l v e rA R CT C G D - W a v eT A B Us e a r c hq b s o l vM a p p i n g - t o - M e t r i c s u b - *. q u b o. q m a s m 0 1 0 1 0 0 0 0 1x x x x x x x x x 0 1 0 1 0 0 0 0 1 x x x x x x x x x E x t e r n a l M o d e lli n g T oo l s T I G E R M a pp i n g S o l u t i o n s s i z e < l i m i t v a l u es i z e > l i m i t v a l u e Fig. 4: Topology-aware task assignment using TIGER and quantum annealing.identical to the classical topology-aware task assign-ment on extremely heterogeneous architectures, where fidelity mapping represents computation performance tobe maximized and where N swaps represents the com-munication cost to be minimized. Equation 2 showshow optimization of these two metrics can be refor-mulated as total fidelity fidelity total maximization . Alarge number of recent studies target the total circuitfidelity maximization [8]. However, they solve the opti-mization problem of the circuit gate decomposition andassignment to minimize the number of gates, especially SWAP gates, without consideration of fidelity mapping .4.3 Weight Optimization AlgorithmIsing machine weights allow us to vary the priorityof one or another optimization metric. By scaling theweights associated with SWAP minimization, eitherthe qubit fidelity or SWAP reduction can be priori-tized. To scale the weights, a priority coefficient pref is introduced.To arrive at the optimal solutions eitherin terms of the resulting number of SWAP gates in-serted or gate fidelity, we propose an optimization al-gorithm. It searches for the coefficient value that max-imizes fidelity total . Since fidelity total is obtained fromfidelity mapping and fidelity movement , the algorithm canalso find a solution with maximum fidelity mapping orminimum qubit movement. Due to infidelity of SWAPgates, a solution with minimum N swaps should cor-respond to maximum fidelity total solution. However,in a hypothetical fully-connected architecture wherequbit movement constraint is eliminated, fidelity mapping would correspond to fidelity total . In such a scenario itwould be practical to maximize only mapping fidelity.Optimizing only fidelity mapping or N swaps metric can also give an estimate of the bounds of these metrics incase if no optimal solution is known beforehand. More-over, the proposed optimization algorithm can be suit-able when it is needed to maintain a specific compu-tation to communication ratio in task assignment, forexample. The pseudocode is given in Algorithm 1 onthe facing page. The search starts with an initial pref-erence coefficient, gets the corresponding metric value,for example fidelity total , and compares it to other solu-tions with a larger and smaller coefficient. The searchspace range is defined by setting the parameter sSpr .How fast the algorithm converges is defined by the pa-rameter sRed , which reduces the search space at everystep. For better local search space exploitation lines 6-17 can be repeated with sSpr = √ sSpr . IGER: Topology-aware Assignment using Ising machines 9 Algorithm 1: Preference coefficient optimization Data: QCG, ARC Result: fidelity best , pref best sSpr = 2 // search spread, sets the search space range sRed = 0 . // spread reduction, reduces sSpr at everystep for convergence pref best = 0 . // initial preference coefficient fidelity best = tiger ( QCG, ARC, pref ) while sSpr > do pref left = pref/sSpr pref right = pref ∗ sSpr fidelity left = tiger ( QCG, ARC, pref left ) fidelity right = tiger ( QCG, ARC, pref right ) if fidelity left > fidelity best then fidelity best = fidelity left pref best = pref left end if fidelity right > fidelity best then fidelity best = fidelity right pref best = pref right end sSpr = sSpr ∗ sRed end open-source QUBO mapper written in Python. It usesNetworkX python package [7] to create and manipu-late TCG/QCG and ARC structures, i.e. computingthe computation and communication costs for classi-cal problems and fidelity and SWAP costs for quantumproblems taking into account hardware (architecture)topology. We demonstrate TIGER on the D-Wave ma-chine.TIGER receives two files as inputs (marked as red‘1’ to denote step 1), namely TCG or QCG and ARC(architecture). TCG describes the classical applica-tion’s TCG, QCG describes the quantum algorithm’sQCG, while ARC describes the architecture (hardwaretopology). The format of these files is presented in Fig-ure 5 (a) and (b). The TCG file consists of lines of twotypes associated to application tasks and edges. Tasklines contain a task ID and multiple cost values each ofa different type, e.g. number of integer, floating point,memory access instructions. Edge lines contain an edgeID, source and destination task IDs, and a cost value,e.g. the amount of data to be transferred between twotasks in bytes. The architecture file describes the archi-tecture topology and its details such as number of rowsand columns, number of PUs, and the capabilities ofeach PU and link such as cost per type of instructions,link throughput, etc.Using the algorithm described in Section 3, TIGERmaps input TCG and ARC files into the QUBO formatand generates the QMI interface file (step ‘2’). It sup-ports both qmasm and qubo formats and can generatea single file per problem or multiple files in case theQUBO partitioning option is chosen. If the size of the t a s k I D [ c o s t 1 ] [ c o s t 2 ] [ c o s t 3 ]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - e d g e I D t a s k _ 1 t a s k _ 2 [ c o s t ]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - P a r a m e t e r I D s V a l u e- - - - - - - - - - - - - - - - - - - - - - - - - - - - - To p o l o g y M E S HN u m R o w s 2N u m C o l s 2N u m P U s 4P U . 0 . . 3 1 , 2 , 2 , 4L i n k 0 . . 3 2 , 2 , 2 , 2… a) Application TCG file format b) Architecture ARC file format Fig. 5: Topology-aware task assignment problem input.Table 1: Benchmark suite Workload Problem size Tasks Ultrasound 9x5x10 15 15RS-encoder 32x28x8 141 140RS-decoder 32x28x8 526 789 problem is less than the physical limit value, i.e. qubitsub-matrix size, QUBO or sub-QUBO can be directlysolved (step ‘3’). Otherwise, it has to be further de-composed by qbsolv and then solved (step ‘4’). In bothcases the problem is solved by two available solvers: theD-Wave annealer or a TABU search qbsolv implemen-tation (step ‘5’).Finally, the solver generates mapping solutions thatare sent back to the TIGER tool. If the solution corre-sponds to sub-QUBO (step ‘7’), it is used by TIGERto generate the next sub-QUBO as described in Sec-tion 3.2. If the solution is complete (step ‘6’) or thelast sub-QUBO problem is solved, TIGER calculatesthe final cost of the assignment through its Mapping-to-Metric (MtoM) interface (step ‘8’). This cost can beused to estimate the quality of the solution. Fig. 6: Delay-to-Solution evaluation: (I) - classical TABU-search solver w/o TIGER sQ partition, (II) - quantumDW solver w/o TIGER sQ partition and (III) - quantum DW solver with TIGER sQ partition.the COSMIC benchmark suite [33]. Table 1 shows theset of chosen benchmarks and their characteristics.For quantum QCG assignment optimization, we cre-ate the QCG files formatted for TIGER from the quan-tum benchmark suite [39]. We create ARC files basedon two IBM quantum chips [15]: IBM Yorktown (QX2) with 5 qubits and IBM Vigo with 5 qubits. Figure 3 (c)illustrates these two topologies. The quantum bench-mark suite [39] provides 48 circuits for 5-qubit chips.We reduce the circuit size down to 50 gate.6.2 Tool flow evaluationFor each workload we evaluate three scenarios: (I) TIGER QUBO mapper - qbsolv decomposer/TABU-search qbsolv solver - TIGER MtoM interpretor, (II) TIGER QUBO mapper - qbsolv decomposer/DWsolver - TIGER MtoM interpretor and (III) TIGERQUBO mapper/TIGER SG partitioner - qbsolv decom-poser/ TABU-search qbsolv solver - TIGER MtoM in-terpretor. For each scenario, we vary the size of thearchitecture to a 2 × × × I . In addition, we show the number of logical qubitsand couplers generated by TIGER’s mapper ( qubits and couplers ), the number of partitions provided byqbsolv’s decomposer ( partitions ), and the number ofSGs generated by TIGER’s partitioner ( tiger sQs ).The number of qubits in scenarios I and II is equal,but it is higher in scenario III because additional qubitsare required to define previous sub-QUBO placementsas shown in Figure 1. Similarly, the number of couplersas well as the number of partitions in scenarios I and II are equal. It is lower in scenario III due to the optimized QUBO mapping. The number of TIGER sub-QUBOsis reported only for scenario III . In scenarios I and II this TIGER option is not applied ( na ). Discussion: Performance evaluation results provethat the physical quantum annealer, i.e. DW2X, cansignificantly reduce delay-to-solution compared to theclassical qbsolv solver. For the given set of bench-marks and architecture configurations, the performancespeedup of the DECOMPOSER-SOLVER phase variesbetween 1.2 × and 10.2 × . The major portion of this im-provement is caused by the replacement of the classicalsolver with the quantum annealer. The average valueof the DW2X access time is around 20ms. This timeincludes programming time, sampling time and post-processing time. The sampling phase consists of multi-ple sample batches, each of which includes annealing,readout, and additional delay that allows the quantumannealer to cool down to the initial state. The anneal-ing time is 20 µ s. Although QUBO is solved by a phys-ical quantum annealer, a significant amount of timeassociated to the problem decomposition is spent by qbsolv DECOMPOSER . The total D-Wave SOLVER phase is composed of multiple D-Wave accesses, wherethe number of accesses is determined by the number ofpartitioned calls provided by qbsolv DECOMPOSER .Therefore, while using the quantum annealing solverthe delay-to-solution phase highly depends on the qual-ity of the classical decomposition.In scenario III , we evaluate the impact of thedomain-specific partitioning integrated into the QUBOmapper, i.e. TIGER level partitioner. Here, reportedvalues represent the sum of all sub-QUBO parametersconcerning the total number of qubits and couplers aswell as delays per phase. Results show that by applyingtwo-level QUBO partitioning (i.e. domain-specific firstand classical qbsolv second), a massive speedup in time-to-solution can be achieved. For the given set of TCGsand ARCs, the DECOMPOSER-SOLVER phase is re-duced down to 6% compared to the baseline scenario. IGER: Topology-aware Assignment using Ising machines 11(a) Ultrasound-9x5x10(b) Reed-Solomon Encoder-32x28x8(c) Reed-Solomon Decoder-32x28x8 Fig. 7: Task assignment sensitivity and quality of the solution. (DW, single) : DW w/o sQ vs. classical TABU-search w/o sQ, (qbsolv, sQ) : classical TABU-search with sQ vs. classical TABU-search w/o sQ and (DW, sQ) :DW with sQ vs. classical TABU-search w/o sQ.Such an improvement has several sources. First, TIGERpartition significantly simplifies the task for qbsolv DE-COMPOSER , which performs better on a smaller sub-set of qubits and coupler tasks than on a single largeproblem. Consequently, qbsolv generates fewer parti-tion calls thereby reducing D-Wave SOLVER phasedelay. This effect is particularly noticeable for largerTCGs where the number of partitions is reduced twice.The total number of qubits and couplers is also differ-ent compared to the baseline. By applying the mini-mum number of qubits possible and adjusting the levelof granularity (i.e. one sub-level per sub-QUBO), we re-duce the total number of couplers. These improvementsare achieved at the expense of having a larger numberof qubits. This increase is 12% by average compared tothe baseline. On the other hand, additional partitioningcan potentially impact the quality of the generated so-lution. This effect is evaluated in the following section.6.3 Task assignment evaluationWe evaluate the assignment quality and multiple-runsensitivity in three comparison scenarios: (i) singleQUBO on quantum annealer versus classical qbsolv solver ( dw, single ), (ii) partitioned sub-QUBOs versussingle QUBO assignment on classical qbsolv solver ( qb-solv, sQ ), and (iii) partitioned sub-QUBOs on quan-tum annealer versus single QUBO assignment on classi-cal qbsolv solver ( DW, sQ ). Architecture configurationfiles represent a 2 × 2, 4 × 4, or 8 × × to 4 × . Link cost is equal to 2. Figure 7 showsthe difference in computation, communication and to-tal costs for the three evaluation scenarios compared tothe baseline. Discussion: In some cases, we obtain the same so-lution over multiple runs. If different solutions are re-turned, usually the variation is within 5% from themean value. For a given set of experiments, DW2Xquantum solver provides solution improvements for asingle QUBO compared to the classical TABU-searchsolver. Results show up to 8% of computation cost im-provement, up to 25% of communication cost improve-ment, and up to 15% of total improvement. Both qb-solv sQ and DW sQ scenarios show similar behaviourin most experiments. However, again DW2X quan-tum solver provides better solutions, e.g. RS-Encoder Fig. 8: IBM Vigo: mapping fidelity, number of swapsand total fidelity.mapped on 2 × × × for US TCGmapped on 2 × 2, 4 × × RS Encoder TCG as shown in Fig-ure 7(b). However, the computation constituent doesnot deteriorate. In both TCGs, task computation costsfar outweigh communication edge cost. For instance, US computation cost ranges between 4,510 and 3,461,112,while communication highest cost is 20, 60 and 140 for2 × 2, 4 × 4, and 8 × RS Decoder TCG iscommunication intensive. The computation cost variesto up to 1,880, while the communication cost reaches14,280 for 8 × swaps and fidelity mapping a set of experiments was performed on the QCGs men-tioned in Section 6.1.2. The preference coefficient variesfrom 0.01 to 30. Figure 8 shows the mapping fidelity(fidelity mapping ), number of swaps (N swaps ) and totalfidelity for different coefficient values. Smaller coeffi-cients minimize qubit state movement, while larger onesprioritize mapping fidelity instead. Black box shows a near-optimum region of the priority coefficient. Usingthe priority coefficient smaller than 0.05 results in in-valid solutions being produced by the algorithm andcan even lead to the opposite effect, increasing N swaps instead. Setting the coefficient larger than 20 providesonly small improvement of fidelity mapping , but it onlyhappens in some architectures and incurs an inadequatenumber of additional SWAPs. Hence, applicable coef-ficient values that produce the minimum N swaps andmaximum fidelity mapping are approximately 0.05 and20, respectively. Total fidelity strongly correlates withthe number of SWAPs and mapping fidelity plays anegligible role in this scenario. Discussion: Since fidelity movement coming fromN swaps has a larger impact on fidelity total , usuallyN swaps is minimized and gate fidelity is not consid-ered. It means that the priority coefficient that max-imizes fidelity total is the same that minimizes N swaps ,i.e. 0.05. However, as connectivity in quantum com-puting architectures increases, qubit movement mightbecome less significant. In such a context maximiza-tion of fidelity total would be entirely dependent onfidelity mapping . To tackle any possible scenario, fidelity total can be max-imized regardless of connectivity and gate fidelity. Thepriority coefficient that allows such a maximization isunknown, and can vary for every different circuit andarchitecture. We study the proposed weight optimiza-tion algorithm to assess its efficiency in finding the op-timal priority coefficient for a combination of quantumcircuit and device topology.Figure 9 shows total fidelity and number of SWAPsoptimization using WOA algorithm for multiple circuitsfor IBM Vigo and IBM QX2 topologies. The results in-clude initial value at the beginning of the algorithmexecution and the final value. For IBM Vigo topology(results in Figure 9 (a) and (b)), the WOA finds thepriority coefficient that reduces the number of SWAPsfrom the initial step value in 62.5% of cases. In 37.5%of cases the number of SWAPs remains unchanged. Theresults with strong reduction are highlighted in green.In average, WOA improves total fidelity by 39% forIBM Vigo topology. For IBM QX2 topology (results inFigure 9 (c) and (d)), the WOA finds the priority co-efficient that reduces the number of SWAPs from theinitial step value in 83.3% of cases. In one case the num-ber of SWAPs remains unchanged, and in 14.6% of casesWOA provides weak increase of the SWAPs number. Inaverage, WOA improves total fidelity by 107% for IBMQX2 topology. IGER: Topology-aware Assignment using Ising machines 13(a) Vigo: Fidelity (b) Vigo: Number of SWAPs(c) QX2: Fidelity (d) QX2: Number of SWAPs Fig. 9: Quantum gate assignment: wieght optimization algorithm search Discussion: The results show significant differencein WOA performance when applied on different topolo-gies. While in general WOA allowed us finding moresuitable combination of QUBO weights (preference co-efficient) for both topologies, IBM QX2 mapping ismuch more sensitive towards priority coefficient choice.Moreover, in few cases WOA missed optimal solutionthat resulted in a weak increase in SWAPs number com-pared to the initial state value. We believe, that the rea-son lies in the complexity of the topology graph thatcalls for the QUBO weights adjustments to find themost suitable combination in a near-optimum region. Finally, we compare the performance of TIGERtopology-aware SWAP optimizer against the IBM QXoptimizer. Figure 10 shows the comparison resultsacross multiple circuits for two topologies, i.e. vigo and qx2 . The numbers show the final number of SWAPS.The SWAP reduction color map highlights the caseswhen one of the optimizer provides a better result withthe SWAP number differences as follow: (i) 1-2 SWAPs,(ii) 3-4 SWAPs, (iii) 5-7 SWAPs or (iv) more than 7.For the vigo topology, TIGER and IBM QX provides same SWAP number in 18.7% of cases; IBM QX outper-forms TIGER in 41.7% of cases with the total reductiondifference of 51 SWAPs; and TIGER outperforms IBMQX in 39.6% of cases with the total reduction differenceof 59 SWAPs. For the qx2 topology, TIGER and IBMQX provides same SWAP number only in 4.2% of cases;IBM QX outperforms TIGER in 8.3% of cases with thetotal reduction difference of 12 SWAPs; and TIGER sig-nificantly outperforms IBM QX in 87.5% of cases withthe total reduction difference of 260 SWAPs. Moreover,TIGER found the perfect mapping reducing the datamovement to 0 SWAPs in 16.7% of cases, while IBMQX found the perfect matching only in 4.2% of cases. Discussion: Similar to the WOA evaluation results(see section 6.4.1), the comparison results show signif-icant difference when applied on different topologies.TIGER allowed us significantly improve the mappingfor IBM QX2 topology compared to the IBM QX opti-mizer. We believe, that the reason also lies in the topol-ogy graph complexity. Classical IBM QX optimizer isnot suitable for more complex topologies with a largernumber of potential combinations, while TIGER opti-mizer allows us to find the ‘perfect’ mapping regardless. T o p o l ogy O p t i m i zer t - v1_81 4g t t t t t t t - v1_93 4g t t t m o d - v0_18 4 m o d - v0_19 4 m o d - v0_20 4 m o d - v1_22 4 m o d - v1_23 4 m o d - v1_24 4 m o d - v0_94 4 m o d - v1_96 a j- e l u - v0_26 a l u - v0_27 a l u - v1_28 a l u - v1_29 a l u - v2_31 a l u - v2_32 a l u - v2_33 a l u - v3_34 a l u - v3_35 a l u - v4_36 a l u - v4_37 d ec o d - bdd _294 d ec o d - v1_41 d ec o d - v3_45 h w b m i n i - a l u _167 m o d m o d m o d d m o d d m o d m il s _65 o n e -t w o -t h ree - v0_97 o n e -t w o -t h ree - v0_98 o n e -t w o -t h ree - v1_99 o n e -t w o -t h ree - v2_100 o n e -t w o -t h ree - v3_101 r d vigo TIGER 16 18 8 6 4 14 14 17 18 18 20 13 18 11 5 4 15 10 16 16 18 17 11 13 12 18 14 13 19 11 16 11 15 15 17 19 14 15 17 7 17 10 18 18 18 18 19 16 IBM QX 16 16 12 9 6 21 21 17 15 15 19 21 16 12 5 8 15 12 15 17 21 18 12 12 12 15 17 9 16 12 16 12 14 14 14 13 18 20 13 7 17 10 16 13 16 15 18 17 qx2 TIGER IBM QX 10 6 4 1 0 0 0 11 11 6 9 11 10 6 5 5 13 6 10 11 6 12 5 5 5 10 7 2 8 5 10 5 5 10 10 13 6 11 10 8 8 9 6 10 10 11 16 6 SWAP Reduction Difference (Color Map) 1-2 3-4 5-7 >7 Fig. 10: Optimizer comparison: TIGER vs. IBM QX In this paper, we propose an algorithm for solvingthe topology-aware task/gate assignment problem onphysical Ising machines in order to accelerate andimprove the quality of the solution to this challeng-ing NP-complete problem. We implement our solu-tion in our TIGER tool that transforms weightedtask-communication, quantum circuit, and architecturegraphs into an appropriate format of the Hamiltonianfunction. Our solution takes into account both compu-tation and communication costs for the classical prob-lem or fidelity and SWAP number for the quantumproblem. We evaluate the proposed approach using D-Wave’s quantum annealer. In order to overcome exist-ing physical limitations of current quantum annealers,we propose domain-specific partitioning based on thetask-communication graph dependency levels. Also, wepropose weight optimization algorithm that enables ad-justing the model parameters and find better solutions.We integrate TIGER into the D-Wave software stackthat enables us to apply both our proposed dependency-level partitioning as well as the partitioning provided bythe qbsolv tool in a dynamic iterative way. We demon-strate that our method can reach 15% higher-qualitysolutions 9% faster compared to the classical qbsolvheuristic algorithm. Finally, TIGER reduces the datamovement cost by 68% in average for quantum circuitassignment compared to the IBM QX optimizer [15].Our work alleviates the concern that task mapping mayhinder high-quality solutions on future quantum accel-erators with more physical qubits and complex connec-tivity. The TIGER tool is publicly available online .For future work, we consider three major directions: – Comparison to a wide range of classicalscheduling tools : we plan to design a methodol-ogy to compare the hardware optimizer, i.e. Isingmachine, to existing heuristic software tools. https://github.com/lbnlcomputerarch/tiger – Use other Ising machines : we plan to expand ourstudy running the problem on other Ising machines,such as digital annealer [9] and coherent Ising ma-chine [37]. – Problem partitioning algorithms and addi-tional constrains mapping : we plan to evaluateadditional graph partitioning algorithms and alter-native problem mapping algorithms, e.g. assigningmultiple tasks in one node based on the node capac-ity. Acknowledgements The research leading to these resultshas received funding from the the U.S. Department of En-ergy, grant agreement n o DE-AC02-05CH11231. References 1. Bokhari, S.H.: On the mapping problem. IEEE Trans-actions on Computers C-30 (3), 207–214 (1981). DOI10.1109/TC.1981.16757562. Booth, M., Reinhardt, S.P., Roy, A.: Partitioning opti-mization problems for hybrid classical/quantum execu-tion. Tech. rep. (2017)3. Burkard, R., Dell’Amico, M., Martello, S.: AssignmentProblems. Society for Industrial and Applied Mathemat-ics, PA, USA (2009)4. Chan, C.P., Bachan, J.D., Kenny, J.P., Wilke, J.J., Beck-ner, V.E., Almgren, A.S., Bell, J.B.: Topology-aware per-formance optimization and modeling of adaptive meshrefinement codes for exascale. In: 2016 First Interna-tional Workshop on Communication Optimizations inHPC (COMHPC), pp. 17–28 (2016)5. Daskalakis, C., Dikkala, N., Kamath, G.: Testing isingmodels. IEEE Transactions on Information Theory pp.1–1 (2019). DOI 10.1109/TIT.2019.29322556. Denchev, V.S., Boixo, S., Isakov, S.V., Ding, N., Bab-bush, R., Smelyanskiy, V., Martinis, J., Neven, H.: Whatis the computational value of finite-range tunneling?Phys. Rev. X , 031015 (2016). URL https://link.aps.org/doi/10.1103/PhysRevX.6.031015 (4), 045003 (2018). DOI 10.1088/2058-9565/aacf0b11. Glover, F., Laguna, M.: Tabu Search. Kluwer AcademicPublishers, Norwell, MA, USA (1997)12. Glover, F.W., Kochenberger, G.A.: A tutorial on formu-lating qubo models. ArXiv abs/1811.11538 (2018)13. Hoefler, T., Snir, M.: Generic topology mapping strate-gies for large-scale parallel architectures. In: Proceedingsof the International Conference on Supercomputing, ICS’11, pp. 75–84. ACM, New York, NY, USA (2011). URL http://doi.acm.org/10.1145/1995896.1995909 14. Hwang, F.K.: The hamiltonian property of linear func-tions. Oper. Res. Lett. (3), 125–127 (1987). DOI 10.1016/0167-6377(87)90024-1. URL http://dx.doi.org/10.1016/0167-6377(87)90024-1 C-36 (4), 433–442 (1987)21. Li, G., Ding, Y., Xie, Y.: Tackling the Qubit MappingProblem for NISQ-Era Quantum Devices. arXiv e-printsarXiv:1809.02573 (2018)22. Mandr`a, S., Zhu, Z., Wang, W., Perdomo-Ortiz, A., Katz-graber, H.G.: Strengths and weaknesses of weak-strongcluster problems: A detailed overview of state-of-the-artclassical heuristics versus quantum approaches. ArXive-prints (2), 022337 (2016)23. Markov, I.L., Fatima, A., Isakov, S.V., Boixo, S.: Quan-tum Supremacy Is Both Closer and Farther than It Ap-pears. arXiv e-prints arXiv:1807.10749 (2018)24. Nielsen, M.A., Chuang, I.L.: Quantum Computation andQuantum Information: 10th Anniversary Edition, 10thedn. Cambridge University Press, New York, NY, USA(2011)25. Orduna, J.M., Silla, F., Duato, J.: A new task mappingtechnique for communication-aware scheduling strate-gies. In: Proceedings International Conference on ParallelProcessing Workshops, pp. 349–354 (2001)26. Pakin, S.: A quantum macro assembler. In: 2016IEEE High Performance Extreme Computing Conference(HPEC), pp. 1–8 (2016)27. Pakin, S., Reinhardt, S.P.: A survey of programmingtools for d-wave quantum-annealing processors. In:R. Yokota, M. Weiland, D. Keyes, C. Trinitis (eds.) HighPerformance Computing, pp. 103–122. Springer Interna-tional Publishing, Cham (2018)28. Park, J.L.: The concept of transition in quantum mechan-ics. Foundations of Physics (1), 23–33 (1970). DOI10.1007/BF00708652. URL https://doi.org/10.1007/BF00708652 29. Salimi, R., Motameni, H., Omranpour, H.: Task schedul-ing with load balancing for computational grid using nsga ii with fuzzy mutation. In: 2012 2nd IEEE InternationalConference on Parallel, Distributed and Grid Computing(2012)30. Schaeffer, S.E.: Survey: Graph clustering. Comput. Sci.Rev. (2007)31. Taura, K., Chien, A.: A heuristic algorithm for map-ping communicating tasks on heterogeneous resources.In: Proceedings 9th Heterogeneous Computing Workshop(HCW 2000) (Cat. No.PR00556), pp. 102–115 (2000)32. Tran, T.T., Do, M., Rieffel, E.G., Frank, J., Wang,Z., O’Gorman, B., Venturelli, D., Beck, J.C.: A hybridquantum-classical approach to solving scheduling prob-lems. In: Ninth Annual Symposium on CombinatorialSearch (2016)33. Wang, Z., Liu, W., Xu, J., Li, B., Iyer, R., Illikkal, R.,Wu, X., Mow, W.H., Ye, W.: A case study on the commu-nication and computation behaviors of real applicationsin noc-based mpsocs. In: 2014 IEEE Computer SocietyAnnual Symposium on VLSI, pp. 480–485 (2014)34. Wayne Bollinger, S., Midkiff, S.: Processor and link as-signment in multicomputers using simulated annealing.In: ICPP, vol. 1, pp. 1–7 (1988)35. Wille, R., Burgholzer, L., Zulehner, A.: Mapping Quan-tum Circuits to IBM QX Architectures Using the Mini-mal Number of SWAP and H Operations. arXiv e-printsarXiv:1907.02026 (2019)36. Wootters, W.K., Zurek, W.H.: A single quantum cannotbe cloned. Nature (5886), 802–803 (1982). DOI10.1038/299802a0. URL http://dx.doi.org/10.1038/299802a0 37. Yamamoto, Y., Aihara, K., Leleu, T., Kawarabayashi,K.i., Kako, S., Fejer, M., Inoue, K., Takesue, H.: Co-herent ising machines—optical neural networks operat-ing at the quantum limit. npj Quantum Information (1), 49 (2017). DOI 10.1038/s41534-017-0048-9. URL https://doi.org/10.1038/s41534-017-0048-9 38. Zick, K.M., Shehab, O., French, M.: Experimental quan-tum annealing: case study involving the graph isomor-phism problem. Scientific Reports , 11168 EP – (2015).URL http://dx.doi.org/10.1038/srep11168http://dx.doi.org/10.1038/srep11168