[PDF] Architecture aware compilation of quantum circuits via lazy synthesis

Abstract

Qubit routing is a key problematic related to quantum circuit compilation. It consists in rewriting a quantum circuit by adding the least possible number of instructions to make the circuit compliant with some architecture's connectivity constraints. Usually, this problem is tackled via either SWAP insertion techniques or re-synthesis of portions of the circuit using architecture aware synthesis algorithms. In this work, we propose a meta-heuristic that couples the iterative approach of SWAP insertion techniques with greedy architecture aware synthesis routines. We propose two concrete compilation algorithms based on this meta-heuristic and compare their performances to SWAP insertion techniques for several standard classes of quantum circuits. We compare our approach with existing universal compilation techniques and show significant reduction in the entangling gate overhead due to compilation.

Full PDF

AArchitecture aware compilation of quantum circuitsvia lazy synthesis

Simon MartielAtos Quantum Lab.Les Clayes-sous-bois, France Timoth´ee Goubault de Brugi`ereLaboratoire de Recherche en Informatique,Orsay, FranceDecember 18, 2020

Abstract

Qubit routing is a key problematic related to quantum circuit compilation. It consists inrewriting a quantum circuit by adding the least possible number of instructions to make thecircuit compliant with some architecture’s connectivity constraints. Usually, this problem istackled via either SWAP insertion techniques or re-synthesis of portions of the circuit usingarchitecture aware synthesis algorithms. In this work, we propose a meta-heuristic that couplesthe iterative approach of SWAP insertion techniques with greedy architecture aware synthesisroutines. We propose two concrete compilation algorithms based on this meta-heuristic andcompare their performances to SWAP insertion techniques for several standard classes of quan-tum circuits. We compare our approach with existing universal compilation techniques andshow signiﬁcant reduction in the entangling gate overhead due to compilation.

Compilation is a key step in any software stack. Programs are often speciﬁed using a high-levelprogramming language that allows the programmer to describe the manipulation of the proces-sor’s memory using abstract structures. This high-level description is then reﬁned, sometimesin several stages, until it can be fully expressed as a sequence of low level instructions that canbe executed by the processor. Quantum programming makes no exception. In order to leveragethe power of a quantum processor, one needs to compile high-level quantum programs into lowerlevel sequences of quantum instructions. This compilation step is particularly critical in the caseof so called NISQ processors [Pre18]. In these settings, the quantum instructions are prone toerrors and the quantum memory undergoes decoherence phenomena leading to quite large errorrates. Consequently, there is a strong need for eﬃcient heuristics to reduce the instruction countwhile still satisfying the architecture’s constraints.One of the most challenging problem in the ﬁeld of compilation of quantum programs isthe qubit routing problem. Most quantum processors come with a limited chip connectivity,only allowing a (usually) small number of couplings between the diﬀerent qubits. The inputcircuit should therefore be altered in order to only make use of the available interactions. Thisproblem is traditionally tackled via the insertion of additional SWAP gates inside the circuit inorder to move logical qubits from one physical qubit to the other [HNYN11, SSP13, LDX18,ZPW17, CSU19]. These techniques are inherently ineﬃcient in the sense that they can only addgates to the compiled circuits and usually ignore the nature of the computation. Most of thesealgorithms lead to quite large SWAP/CNOT overheads when compared to the original circuitsize. These overheads can be detrimental to the success rate of the algorithm.More recently, people started investigating the transverse approach of synthesizing quantumcircuits that are readily compliant with a given connectivity. This approach is usually restricted a r X i v : . [ qu a n t - ph ] D ec o a particular subset of quantum circuits such as Boolean linear operators [KvdG19, dBBV + • Pick a subgroup of unitary operators that are easy to represent classically. By easy wemean that their classical representation has a polynomial size in the number of qubits andcan be eﬃciently updated for composition. • Initialize a data structure representing the identity. • Iterate over the input circuit: – if the incoming gate belongs to the subgroup, update the current data structure withthis gate, – if not, ﬁgure out a way to synthesize a piece of the current data structure into a circuitsuch that one can safely insert the incoming gate in the output circuit.This paper is organized as follows. We start by formalizing the above succinctly describedmeta-algorithm using what we call the lazy synthesis framework. Section 3 shows how a stan-dard SWAP insertion algorithm from [HNYN11] ﬁts into this framework. We then extend thisalgorithm by using the group of linear Boolean reversible operators in section 4 and the Clif-ford group in section 5. Some benchmarks against standard classes of circuits are providedand discussed in section 6, together with a comparison with recent works in general purposecompilation. Finally, we propose some possible extensions, and conclude in a last section. In this section we present a general formulation of the lazy synthesis meta-heuristic.

Notations: • Circuits are words on a (potentially inﬁnite) gate set. We use :: for concatenation, and ε for the empty circuit. • Given some gate g , we denote by ˜ g its corresponding n -qubits unitary operator, andextend this notation to circuits. For instance, given a circuit c = g :: g as a word, thecorresponding equation in U (2 n ) is ˜ c = ˜ g · ˜ g where · stands for the standard linearoperator composition.To introduce our framework, we ﬁrst need to introduce some simple conventions. We willassume that the input circuit is a sequence of gates taken from a set G in , and that the outputcircuit should have gates in another gate set G out . Here, we voluntarily use a quite broad notionof gate set. For instance, G out could contain the exact same gates as G in but with additionalconstraints, such as connectivity constraints. We will also assume that we have access to somedata structure D = (cid:104)H , (cid:74) . (cid:75) , S, u, e (cid:105) representing a class of unitary operators, with the followingconstraints: H is some set of classical descriptions. We will usually require these descriptions to besmall (i.e polynomial in the number of qubits and/or the number of input gates). • (cid:74) . (cid:75) : H → U (2 n ) is an interpretation of the descriptions in H as unitary operators. • S ⊆ G in is a subset of the input gate set. Our data structure D corresponds to the classof operators that can be implemented by circuits with gates from S . • u : H × S → H is an update function that veriﬁes: (cid:74) u ( h, g ) (cid:75) = ˜ g. (cid:74) h (cid:75) that is, u is sound with respect to (cid:74) . (cid:75) . Less formally u updates h into u ( g, h ) by absorbing g into h . We will usually require for u to eﬃciently update h (i.e runs in polynomial timew.r.t. the size of h ). • e : H × S → H × G ∗ out , where S is the complement of S in G in . The function e is anextraction function that veriﬁes: h (cid:48) , c = e ( h, g ) = ⇒ ˜ g. (cid:74) h (cid:75) = (cid:74) h (cid:48) (cid:75) . ˜ c Less formally, e tells us how to commute g with h as the cost of updating h into h (cid:48) andturning g into a sub-circuit c . We will usually require e to be eﬃcient.Equipped with such a data structure, we can describe our meta-heuristic as the simple recipedetailed in Algorithm 1. Algorithm 1

Lazy synthesis meta-heuristic procedure LazySynth ( c in ) h ← Id c out ← ε for g in c in do if g ∈ S then h ← u ( h, g ) else h (cid:48) , c = e ( h, g ) h ← h (cid:48) c out ← c out :: c end if end for return h, c out end procedure The main idea of the heuristic is to iteratively aggregate gates of c in in h and c out whilemaintaining the invariant: ˜ c in [1 ..i ] = (cid:74) h (cid:75) · ˜ c out . That is: after compiling gate i , the initialsegment c in [1 ..i ] is equivalent to the composition of the current output circuit c out followed bythe current stored operator h . It is easy to check the soundness of the algorithm using theexpected properties of u and e . The process of our meta-heuristic is illustrated in Fig 1.In other words, the gates in S are the ones we want to avoid executing by the quantumprocessor. As they belong to a group of eﬃciently simulable operators, our goal is to keep trackclassically of their action on the memory as long as possible with the use of our update function u . When a gate g not belonging to S arises in the circuit, we try to minimize the quantity ofextra gates needed to execute g while keeping the functionality of the global operator. This isthe goal of the extraction function e .As you can notice, all the complexity of the heuristic lies in the implementation of theupdate and extraction functions u and e . These functions will heavily rely on the underlyingdata structures. out (cid:74) T (cid:75) g c in g ∈ S g (cid:54)∈ Sc out (cid:74) u ( T, g ) (cid:75) c in Updated circuitCurrent circuit G -compatible Notexecuted Nextgate c out c corr (cid:74) T (cid:48) (cid:75) c in G -compatible+shortnew c out T (cid:48) , c corr = e ( T, g ) Figure 1: Illustration of Algorithm 1. At any stage of the algorithm, we have the invariant˜ c in (cid:74) T (cid:75) ˜ c out which is equal to the operator implemented by the input quantum circuit.4 n the next section, we show how to embed a SWAP insertion technique described in[HNYN11] into this framework. Later we will extend it to a broader set of operators to im-prove its performances. In [HNYN11], the authors propose a heuristic to iteratively rewrite a quantum circuit by in-serting SWAP gates to route logical qubits. In this approach, we will rely on the fact thatelements in the group S n can be eﬃciently represented and manipulated. In order to representan element σ ∈ S n , we can simply store an array of integers [ σ (1) , ..., σ ( n )]. Moreover, given therepresentations of two permutations σ and π , the representation of σ ◦ π is simple to compute. Data structures.

We now describe how this algorithm is a particular case of our framework.We ﬁrst need to deﬁne G in , G out , and S ⊆ G in : • G in contains any gate acting on at most 2 qubits, • G out contains any gate acting on at most 2 qubits and such that the gate is compatiblewith some connectivity graph G , • ﬁnally S = { SWAP i,j , i, j ∈ [ n ] } the set of all possible qubit SWAPs.The classical data structure simply describes a qubit permutation: D = (cid:104) S n , (cid:74) . (cid:75) , S, u, e (cid:105) , where: • S n denotes the permutation group over n elements, where n = | V ( G ) | is the number ofqubits. • (cid:74) . (cid:75) trivially associates to a permutation the corresponding n -qubit unitary operator • u composes the current permutation with an incoming swap: u ( σ, SWAP i,j ) = ( i, j ) ◦ σ We now describe our extraction routine. Given some gate g in the input circuit. If g is suchthat σ − ( g ) is compatible with G , we can simply use the fact that: g. (cid:74) σ (cid:75) = (cid:74) σ (cid:75) .σ − ( g )to set e ( σ, g ) = σ, σ − ( g ). However, if σ − ( g ) is not compatible with G , we need to producea piece of G compatible SWAP circuit c π implementing a permutation π such that σ (cid:48)− ( g ) iscompatible with G , with σ (cid:48) = σ ◦ π − . Then, we have that: g. (cid:74) σ (cid:75) = (cid:113) σ ◦ π − (cid:121) .σ (cid:48)− ( g ) . (cid:74) π (cid:75) = (cid:113) σ ◦ π − (cid:121) .σ (cid:48)− ( g ) . ˜ c π If we can produce such a circuit c π , we can set e ( σ, g ) = σ ◦ π − , c π :: σ (cid:48)− ( g ). We nowdescribe how such a SWAP circuit is produced in Hirata et al algorithm.Considering the fact that we need gate σ (cid:48)− ( g ) = ( π ◦ σ − )( g ) to be compatible with G , π can be seen as a permutation bringing the qubits of σ − ( g ) close to one another in G . Let a, b bethe pair of qubit on which g acts and let p = ( σ − ( a ) = p , ..., p k = σ − ( b )) be the shortest pathfrom σ − ( a ) to σ − ( b ) in G . The algorithm enumerates k − σ − ( a ) toward σ − ( b ) along p and vice-versa until they meet somewhere along an edge of p . Foreach of these permutations, the algorithm is called recursively for the next w entangling gates,and the permutation leading to the lowest SWAP overhead is picked and committed to theoutput circuit, thus producing c π . Figure 2 gives such an example or permutation enumeration.The general structure of such a recursive search is described in Appendix B. As expected, theperformances of this algorithm heavily depend on the recursion depth parameter w .The overall worst case complexity of this algorithm is O ( mn w ), with m the number ofentangling gates and n the number of qubits, and neglecting the pre-computing of shortest-paths. × , , ,

7) (dashed edges in (a)). We then explore three diﬀerentpermutations, each generated by k − w entangling gates for some ﬁxed parameter w . Now, using the lazy-synthesis framework to describe a SWAP insertion algorithm may seema bit tedious and unnecessary. In this section, we show how, by extending our classical datastructure, we can generalize Hirata et al approach to outperform it in some settings.

We consider the set of reversible circuits over n qubits comprising only CNOT gates. This setgenerates the entire set of reversible linear Boolean operators over n variables, and in particularcontains the set of all n elements permutations. This set has a lot of nice properties: it is easyto represent its elements via some n × n Boolean tables, each row representing an output parityof the circuit [AAM18]. More precisely, given a linear reversible operator A ∈ F n × n acting on n qubits at initial values x = ( x , x , ..., x n − ) , x i ∈ { , } , the logical value of the i-th qubit afterexecution of A is given by α x ⊕ α x ⊕ ... ⊕ α n − x n − where α = A [ i, :] is the i-th row of A and ⊕ stands for the XOR operation. Therefore we cankeep track with a polynomially-sized structured of the action of CNOT circuits on the quantummemory.Moreover, it is simple to update such tables via some row (resp. column) operations toaccommodate for left (resp. right composition) of the operator by a CNOT [PMH08]. Moregenerally, given an initial table A and a linear reversible circuit implementing a table B , theupdated table is given by BA . Lazy linear synthesis . Our gate sets are deﬁned as follows: • G in contains any 1-qubit gate and CNOT gates on arbitrary pairs of qubits, thus alsoincluding SWAPs, • G out contains any 1-qubit gate and CNOT gates compatible with some connectivity graph G , • ﬁnally S = { CN OT i,j | i, j ∈ V ( G ) } is the set of CNOT gatesThe classical data structure describes reversible linear boolean operators over n = | V ( G ) | qubits: • H is the set of invertible n by n boolean matrices, • (cid:74) . (cid:75) trivially associates to a linear operator the corresponding n -qubit unitary operator, u updates a table as expected with a matrix/matrix product: u ( A, CN OT i,j ) = E i,j .A where E j,i is the table representation of the operator CN OT j,i given by the identity matrixwith one additional 1 at row i , column j . In practice, given the simple structure of the E i,j operators, we recover the property that the action of a left-composition by a CNOToperator on A is equivalent to a row operation on the table A .Given some incoming 1-qubit gate g acting on qubit q and some linear operator A , thebehavior of our extraction routine relies on the following two properties:1. if A has shape: A =  q B (cid:48) ... B (cid:48)(cid:48) q · · · · · · B (cid:48)(cid:48)(cid:48) ... B (cid:48)(cid:48)(cid:48)(cid:48)  (1)then A acts as the identity on qubit q . Consequently, any 1-qubit gate acting on qubit q can commute with A .2. For any B ∈ F invertible, we have the relation (cid:74) A (cid:75) = (cid:113) ABB − (cid:121) = (cid:74) AB (cid:75) · (cid:113) B − (cid:121) . This means that if we add a linear reversible circuit implementing B − to our current cir-cuit, then to preserve the functionality of our quantum circuit the classical representationof the qubits is updated by AB .One can always ﬁnd an operator B such that AB has the shape given by Eq. (1). Givensuch an operator B , we have˜ g · (cid:74) A (cid:75) = property 2 ˜ g · (cid:74) AB (cid:75) · (cid:113) B − (cid:121) = property 1 (cid:74) AB (cid:75) · ˜ g · (cid:113) B − (cid:121) . Hence, we deﬁne our extraction function e as: e ( A, g ) = (

AB, c :: g )where B is such that AB satisﬁes Eq. (1) and c is a G -compatible circuit implementing B − .In fact, we can slightly relax the structure of AB and apply g on a qubit diﬀerent thanqubit q . Indeed, considering another qubit q (cid:48) (cid:54) = q and writing S q,q (cid:48) the Boolean linear operatorassociated to the swapping operator of qubits q and q (cid:48) , we have˜ g · (cid:74) A (cid:75) = (cid:74) AB (cid:75) · ˜ g · (cid:113) B − (cid:121) = (cid:74) AB (cid:75) · ˜ g · (cid:74) S q,q (cid:48) (cid:75) · (cid:74) S q,q (cid:48) (cid:75) · (cid:113) B − (cid:121) = (cid:74) A ( BS q,q (cid:48) ) (cid:75) · ˜ g (cid:48) · (cid:113) ( BS q,q (cid:48) ) − (cid:121) (2)where g (cid:48) is the gate g executed on qubit q (cid:48) . In other words, as long as A has shape (1) up toa permutation of the columns, one can still apply gate g on the qubit q (cid:48) for which A [: , q (cid:48) ] = e q .Our goal now is to ﬁnd a suitable operator B such that c is the smallest possible. We providea heuristic to construct such a circuit. .2 Partial synthesis routine In order to simplify the description of our heuristic, we can ﬁrst remark that the shape (1) thatwe would like to achieve is stable by inverse. That is, ﬁnding B such that AB has shape (1)is equivalent to ﬁnding B − such that B − A − has shape (1). So instead of working on thecolumns of A we can work on the rows of A − and directly compute a quantum circuit for B − .Notably, due to Eq. (2), the freedom we have in the choice of the column for reducing A toshape (1) is now a freedom in the choice of the row of A − .Given some incoming 1-qubit gate acting on qubit q , our heuristic works in two stages: • We start by setting one row of A − to e Tq . By deﬁnition of the inverse, the q -th row of A produces a bit vector describing which wire of the circuit should be fold using a fan-inCNOT (i.e a cascade of CNOT gates that share the same target) onto one of them in orderto produce { e q } on A − . By Eq. (2) we can choose any of the wire q (cid:48) for which A [ q, q (cid:48) ] = 1. • After choosing a suitable qubit q (cid:48) and updating A − accordingly, the q -th column of theoperator can be zeroed by distributing the q (cid:48) -th row onto every row containing a nonzero q -th component. This can be achieved using a single fan-out CNOT (i.e a cascade ofCNOT gates sharing the same control).Hence, we simply need to be able to produce implementations of fan-in and fan-out CNOTgates that are compliant with our connectivity graph.To perform this synthesis we use a relaxed version of the method described in [KvdG19].The idea is the following: • compute y = e Tq .A , y = { y , ..., y k } • compute a Steiner tree of the connectivity graph G , with terminal nodes { y , ..., y k } • pick a terminal node y i and perform algorithm 2. This routine is a straightforward gen-eralization of the nearest-neighbor implementation of a CNOT gate proposed in [KMS07](c.f their Figure 1) that is relaxed to leave intermediate wires in arbitrary states. It actsby pruning leaves of the tree while preserving the invariant that the leaves of the tree mustbe considered as control qubits for the rest of the fan-in synthesis. All CNOT gates usedin the circuit are compliant with the tree’s connectivity, making the circuit compliant withthe qubits connectivity. Figure 3 gives an example of execution of this routine. Algorithm 2

Fan-in along a tree procedure FanIn ( T, y, root ) c out ← ε while | T | > do v ← a leaf of T thats not root u ← the only neighbor of v if u / ∈ y then c out ← c out :: CN OT ( u, v ) end if c out ← c out :: CN OT ( v, u ) T.remove ( v ) end while return c out end procedure Notice that intermediate wires may be left in a diﬀerent state. Our only goal is to producethe correct parity e q on the root wire, and we take the liberty of freely changing the state of theintermediate wires. The resulting circuit contains 2( l − − k CNOTs where l is the size of the ree and k is the number of terminal vertices (i.e the Hamming weight of y ), including the rootof the tree.Fan-outs are synthesized in a similar fashion, except terminal vertices are found by lookingat lines of the updated operator A (cid:48) that have a non-zero q th component, and algorithm 3 isused to produce a circuit. Algorithm 3

Fan-out along a tree procedure FanOut ( T, y, root ) c out ← ε Ones ← y T (cid:48) ← a copy of T while | T (cid:48) | > do (cid:46) Setting all the vertices of T to 1 v ← a leaf of T (cid:48) u ← the only neighbor of v if u / ∈ Ones then c out ← c out :: CN OT ( v, u ) Ones.insert ( u ) end if T (cid:48) .remove ( v ) end while while | T | > do (cid:46) Getting rid of all 1s (except for root) v ← a leaf of T thats not root u ← the only neighbor of v c out ← c out :: CN OT ( u, v ) T.remove ( v ) end while return c out end procedure This algorithm corresponds exactly to the ﬁll-tree/empty-tree routine of [KvdG19], exceptthat we work on the full hardware graph, and never have to restrict the structure of the Steiner-tree to a “descending“ tree. This approach only works because we heavily rely on the fact thatwe are synthesizing a single row/column and thus allow ourselves to leave intermediate wires inarbitrary states.Both of these routines are quite close to the one used in [KvdG19], except that we allowourselves to be sloppier in the process, and leaving any intermediate qubit in a dirty state,instead of having to preserve invariants when implementing the fan-in/fan-outs.

In practice we improve the algorithm using two independent optimizations.

Dealing with phase gates.

It is unnecessary to zero a column of our current linear operator ifwe just need to insert a phase gate (i.e a diagonal gate). Indeed, since the gate is diagonal, andassuming it is executed on qubit q , it is well-known that the gate commutes with any CNOTwhose target is not q . So the diagonal gate will commute with the subsequent fan-out becauseone can check that the CNOT gates of the fan-out only use the qubit on which the diagonalgate is executed as a control. Hence, this fan-out can be omitted, thus approximately halvingthe number of required CNOT gates. Recursive search of ﬁnite depth.

As mentioned at the end of section 4.1, we can synthesizeour operator B up to some column permutation. This gives us some freedom to perform some y y y r • • •• • • y •• • • y •• • • y • Figure 3: Example of a tree and the corresponding fan-in CNOT circuit generated by algorithm2. The terminal vertices are circled. Intermediate vertices are represented as • . Notice that thisroutine can be improved in order to reduce the depth of the fan-in gate. In this work we decidedto focus on CNOT count and thus did not insist on these lower level optimizations. optimizations when picking the qubit that will eﬀectively receive the incoming 1-qubit gate.To leverage this freedom, we can adopt the same strategy as in Hirata et al SWAP insertionalgorithm. In practice, given an incoming gate g acting on qubit q , we: • compute the set y of rows of A − that need to interact in the fan-in CNOT, • generate a Steiner tree with terminal vertices y , • branch over all choices of y i ∈ y to receive gate g Notice that this boils down to trying all possible terminal vertices as root vertices in algorithm2. We then perform a recursive search as described in Appendix B.Overall, including a recursive search of depth w , the worst case time complexity of ouralgorithm grows as O ( mn w ) where m is the number of 1-qubit gates, n the number of qubitsin the target architecture. Notice that the runtime is linear in the input circuit’s size, but growexponentially in the depth of the recursive search. Dealing with the ﬁnal operator.

In the general case, the ﬁnal linear operator in our classicaldata structure is not trivial. In a general compilation setting, this is not much of an issue, fortwo reasons: • in the setting where we might have a follow up circuit to compile, one can initialize thelinear operator for the next compilation round to the ﬁnal operator of the previous round, • if we just ﬁnished compiling the ﬁnal portion of our full quantum algorithm, one canalways ﬁx the sampled data in order to classically emulate the ﬁnal linear operator. Thisoperation boils down to inverting a simple linear system over F .Moreover, in most NISQ applications, the sampling directive executed at the end of a quantumcircuit are here to estimate the expected value of some Hermitian operator H . Most of the time,this operator is speciﬁed in the Pauli basis. Thus, it is enough to compute a new Hermitianoperator A − HA such that sampling this operator at the end of the compiled circuit is equivalentto sampling the original operator at the end of the input circuit, and this new operator has thesame number of terms as the original operator: (cid:104) | C † in HC in | (cid:105) = (cid:104) | C † out (cid:0) A − HA (cid:1) C out | (cid:105) In fact, this property is true for a larger subgroup: the Cliﬀord group, which is tackled inthe following section. The ﬁxing procedure for the sampling and observable cases are detailedin Appendix A in the more general case of Cliﬀord operators. Generalization to routing via lazy synthesis of Cliﬀordoperators

We now further extend the previous approaches to lazy synthesis of elements of the Cliﬀordgroup.

The Cliﬀord group, C n , is a natural extension of the class of reversible linear Boolean operators.This group is deﬁned as the largest subgroup of the unitary group that stabilizes the group ofPauli operators P n : C n = { U ∈ U (2 n ) , ∀ P ∈ P n , U † P U ∈ P n } (3)Given a Pauli operator P ∈ P n and a real angle θ ∈ R , we deﬁne the Pauli rotation R P ( θ )as: R P ( θ ) = cos( θ/ I − i sin( θ/ P The conjugation property 3 also applies to Pauli rotations, and not only Pauli operators.Hence, for any Pauli rotation of axis P ∈ P n and any angle θ ∈ R , and any U ∈ C n : U † R P ( θ ) U = R U † P U ( θ ) = R P (cid:48) ( s · θ )for some Pauli operator P (cid:48) and some sign s = ± R P ( θ ) U = U R P (cid:48) ( s · θ ) . In fact, this relation can be used to normalize quantum circuits as sequences of non-CliﬀordPauli rotations (i.e Pauli rotations with angles (cid:54) = kπ/ tableaux that specify how they act by conjugation over generators of the Pauli group [AG04,dB11]. In practice, this means that we can implement a data structure T (a tableau), repre-senting a Cliﬀord operator in C that: • can be easily updated T ← ˜ g · T or T ← T · ˜ g for some Cliﬀord gate g , • can be used to eﬃciently compute P (cid:55)→ T P T † for some n -qubits Pauli operator P , yieldinganother Pauli operator (and potentially a phase in ± In the following, we deﬁne the support of a Pauli operator P as the set of qubits such that P acts non-trivially on them. E.g if P = I ⊗ Z ⊗ X ⊗ I , the the support of P is the set { , } since P acts as the identity on qubits 0 and 3. For ease of notations we will drop the ⊗ operators. emark. In the following subsection, we will use the following simple structure to implement aPauli rotation R P ( θ ). We can ﬁrst reduce P to a diagonal operator by conjugating it through acircuit composed of local Cliﬀord gates. This circuit can be built by individually diagonalizingeach component of the Pauli operator: • if the operator acts as X on qubit i , insert a H gate on qubit i , • if it acts as Y on qubit i , insert a √ X = R X ( π/

2) on qubit i .The resulting Pauli operator acts either as Z or I on each qubit. Using the identity CNOT · ZZ · CNOT = IZ , one can reduce the support of P to a single qubit via conjugation by a circuitcomposed of | P | − q . In ﬁne, the resulting Cliﬀordcircuit C veriﬁes R P ( θ ) = C † R Z q ( θ ) C . An example is given in Figure 4. This reduction can beeasily extended to take architecture into account by performing a fan-in CNOT along a Steinertree with the support of the rotation as terminal vertices. (a) XYZI (b)

H HZ √ X √ X † ZZI (c)

H HZ √ X √ X † III (d)

H H R Z ( θ ) √ X √ X † Figure 4: Reduction of a Pauli operator/rotation. (a) the initial Pauli operator. (b) after conjuga-tion via local Cliﬀords, our operator is diagonal. (c) after conjugation with the appropriate CNOTgates, our operator is localized on a single qubit (here, the ﬁrst qubit). (d) the ﬁnal quantumcircuit implementing R XY ZI ( θ ). In that setting we will consider that G in contains only Cliﬀord gates and arbitrary Paulirotations, R P for P ∈ P . G out will contain CN OT, H, R X ( π/ R Z rotations,the CNOTs being restricted to some interaction graph G .In order to use our meta-heuristic, we need to specify our full data structure D = (cid:104)T , (cid:74) . (cid:75) , S, u, e (cid:105) : • T is the set of Cliﬀord operators, or, to be precise, of tableaux representing Cliﬀordoperators, • (cid:74) . (cid:75) is the standard tableau interpretation, • S is the set of Cliﬀord gates, • u is the update of a tableau using a Cliﬀord gate by left composition: u ( T, g ) = ˜ g · T Our extraction function e acts as follows. Upon encountering a non-Cliﬀord Pauli rotation R P ( θ ): i) Compute a Pauli operator P (cid:48) and a phase s = ± s · P (cid:48) = T † P T (ii) For each qbit i in the support of P (cid:48) , if P (cid:48) [ i ] = Y then perform a R X ( π/

2) gate on i , andif P (cid:48) [ i ] = X , perform a H on i . This produces a Cliﬀord circuit c , comprising only localgates. E.g P (cid:48) = IXY ZI , we produce a circuit c = H :: R X ( π/ .(iii) Pick a target qubit q in the support of P (cid:48) , and perform algorithm 2 in order to generatea fan-in CNOT from all qubits in the support of P (cid:48) to q , thus updating the Cliﬀordsub-circuit c (iv) Update T by right composition with ˜ c † : T (cid:48) ← T · ˜ c † (v) Return the updated table T (cid:48) and sub-circuit c :: R Z ( s · θ ) q The following proposition about e holds: Proposition 1.

Let T be a tableau and R P ( θ ) be a Pauli rotation. If T (cid:48) , c = e ( T, R P ( θ )) , then R P ( θ ) . (cid:74) T (cid:75) = (cid:74) T (cid:48) (cid:75) . ˜ c Proof.

By construction, we have that: c = c prep :: R Z ( s · θ ) q with c prep and q such that: c prep :: R Z ( s · θ ) q :: c † prep = R P (cid:48) ( s · θ )where s · P (cid:48) = T † P T , and c prep is a Cliﬀord circuit. This implies that ˜ c = ˜ c prep · R P (cid:48) ( s · θ )To be precise, c prep holds the local basis changes and CNOT cascade necessary to the im-plementation of R P (cid:48) ( s · θ ), plus some stray Cliﬀord operators that might have happened duringthe “dirty” fan-in (corresponding to the dashed box in the example circuit below). C R Z ( s · θ ) C † (cid:74) T (cid:75) • • • • C • • C † C • • C † Hence, we have: R P ( θ ) · (cid:74) T (cid:75) = (cid:74) T (cid:75) · (cid:74) T (cid:75) † · R P ( θ ) · (cid:74) T (cid:75) = (cid:74) T (cid:75) · (cid:16) (cid:74) T (cid:75) † · R P ( θ ) · (cid:74) T (cid:75) (cid:17) = (cid:74) T (cid:75) · R T † P T ( θ )= (cid:74) T (cid:75) · R P (cid:48) ( s · θ ) where s.P (cid:48) = T † P T = (cid:74) T (cid:75) · ˜ c prep † · R Z q ( s · θ ) · ˜ c prep = (cid:113) T · ˜ c prep † (cid:121) · ˜ c = (cid:74) T (cid:48) (cid:75) · ˜ c n ﬁne, our ﬁnal output circuit will always have shape: C out = C (cid:89) i R Z qi ( θ i ) F i L i where C is some Cliﬀord operator, R Z qi ( θ i ) are non-Cliﬀord local Z rotations, F i are architecturecompliant fan-in CNOTs as described by algorithm 2, L i are local Cliﬀord circuits, and q i arethe target qubits used in the Pauli rotation reductions. Recursive search of ﬁnite depth.

Notice that, once again, we have some freedom of choicewhen picking the qubit that will receive the R Z rotation. In practice, we perform a recursivesearch of ﬁnite depth for the next w rotations to synthesize and pick the host qubit that leads tothe least overhead. The branching is very similar to the one described in 4.3. After computingthe Steiner tree with terminal vertices the support of the rotation we are currently synthesizing,one can choose any terminal vertex to be the target of our fan-in. Once again we refer toAppendix B for more details. The overall worst case complexity is the same as the CNOT case.Indeed, the complexity is dominated by the recursive exploration of a search tree where eachvertex exploration requires the generation of a Steiner tree of the architecture graph. Dealing with the ﬁnal Cliﬀord operator.

Once again, we are left with a possibly non-trivialﬁnal Cliﬀord operator C . As stated in the previous section, if one has to compile several piecesof circuits in sequence, one can always initialize the Cliﬀord operator of the next compilationround using C . In the general case where we are done compiling and need to eﬀectively deal withthis operator, we can almost always avoid having to synthesize the full operator C . Section Adescribes how to do so when sampling an observable or sampling bit-strings in the computationalbasis. Rotation merging.

As mentioned in subsection 5.1, any quantum circuit can be reformulatedas a sequence of Pauli rotations with non-Cliﬀord angles (i.e angles (cid:54) = k π ). That is: C (cid:89) i R P i ( θ i )where R P i ( θ i ) are Pauli rotations and C is a ﬁnal Cliﬀord operator. Moreover, this form canbe eﬃciently computed by pulling all the Cliﬀord gates at the end of the circuit. Once sucha product is obtained, one can try to merge rotations with identical axis. This can also bedone eﬃciently by considering each rotation one by one and checking if it can be commuted andmerged with a rotation with an identical axis. This routine is described in Algorithm 4. Noticethat this is not the only way to produce a ﬁnal ordering of the rotations. In particular when weinsert the un-merged rotation in list L (line 14), one would make a diﬀerent choice and insertsooner in the list. By inserting it at the end of the list, we might block some other merges bypreventing the next rotations to commute past it. In order to keep this optimization lightweightand reproducible, we keep things simple and insert the rotation at the end of the list.This optimization has several consequences. First, by merging rotations, we reduce thenumber of calls to the partial synthesis routine. Moreover, by merging rotations, one might endup with a rotation with Cliﬀord angle. Such a rotation can then be pulled and the end of thecircuit, eﬀectively removing it from the sequence of rotations to synthesize. This optimizationis a key feature when dealing with Cliﬀord + T circuits where this type of situation occursregularly. This pre-processing has a worst case time complexity of O ( m n ) where m is thenumber of non-Cliﬀord Pauli rotations and n is the number of qubits. lgorithm 4 Rotation merging procedure MergeRotations ( S ) L ← [ ] for R P ( θ ) in S do for R P (cid:48) ( θ (cid:48) ) in reversed( L ) do if P = P (cid:48) then θ (cid:48) ← θ (cid:48) + θ break end if if P and P (cid:48) do not commute then break end if end for if P was not inserted then L ← L :: R P ( θ ) end if end for return L end procedure Rotation reordering.

Another optimization that can easily be computed is the reorderingof consecutive commuting rotations. Given a sequence of Pauli rotations (cid:81) R P i ( θ i ), one canrewrite it as (cid:81) G (cid:81) i ∈ G R P i ( θ i ) where G are groups of commuting rotations. Notice that inthis expression, while the ﬁrst product is ordered, the second is not. This gives us a leveragefor optimization. In practice, we use a greedy approach consisting in synthesizing the lesscostly rotation ﬁrst. That is, we compute all the Steiner trees necessary to implement all therotations in a given group and start with the rotation that requires the smallest tree. Groupsof commuting rotations are computed greedily using Algorithm 5. Notice that this is not theonly way to produce such a sequence. In practice, trying harder to form larger groups ofcommuting rotation did not seem to improve the benchmark results, hence the rather simplegreedy heuristic. This pre-processing has a worst case time complexity of O ( m n ) where m isthe number of non-Cliﬀord Pauli rotations and n is the number of qubits. Algorithm 5

Rotation grouping procedure GroupRotations ( S ) L ← [ ] G ← {} for R P ( θ ) in S do if R P commutes with all rotations in G then G.insert ( R P ( θ )) continue end if L ← L :: G G ← { R P ( θ ) } end for L ← L :: G return L end procedure Benchmarks

In order to benchmark our method we picked three representative architectures: Rigetti’s As-pen chip (16 qubits), IBM’s Melbourne chip (14 qubits), and a ﬁctive all-to-all (14 qubits)architecture. The idea being that Melbourne’s connectivity is close to a grid, whereas Aspen’sconnectivity contains longer cycles and has a less regular structure. The all-to-all architectureis here to act as a baseline in the benchmarks. The connectivity graphs are described in Figure5. (a) 0 1 2 3 4 5 613 12 11 10 9 8 7 (b) 0 123456 7 8 91011121314 15Figure 5: (a) IBM’s Melbourne and (b) Rigetti’s Aspen connectivity graph

We benchmarked three algorithms: • Hirata et al

SWAP insertion algorithm (generalized to arbitrary connectivity, search depthof 4), denoted swap in the various benchmarks, • lazy synthesis using linear boolean operators (depth of 3), denoted linear in the bench-marks, • lazy synthesis using Cliﬀord operators (depth of 3), denoted cliﬀord in the benchmarks, • lazy synthesis using Cliﬀord operators (depth of 3) with the additional reordering of Paulirotation, denoted cliﬀord (cid:63) in the benchmarks, • lazy synthesis using Cliﬀord operators (depth of 3) with the additional merging of Paulirotation, denoted cliﬀord † in the benchmarks, • lazy synthesis using Cliﬀord operators (depth of 3) with the additional merging and re-ordering of Pauli rotation, denoted cliﬀord (cid:63) † in the benchmarks,on four sets of quantum circuits: • A set of random circuits parameterized by their Cliﬀord gate density (see ﬁgure 6). Thisparameterization helps predict the performances of our methods when applied to otherfamilies of circuits. See below for a description of the random generation process. • A collection of standard circuits taken from [AAM18] that ﬁt on 14 qubits. Circuits aresimply pre-processed by replacing Toﬀoli gates by a standard CNOT + T decomposition.Tables 1, 2, 3 provides the ﬁnal CNOT counts and the relative CNOT overhead for thethree hardware models. • A set of random QAOA instances of MAX-k-LIN-2 (depth 1). These circuits are basicallyphase polynomials with uniform Hamming weights equal to k . The circuits are generatedusing a naive strategy and produce a large amount of CNOTs. Their Cliﬀord densityroughly grows as k − k − (neglecting the ﬁnal layer of non-Cliﬀord X rotations and theinitial Walsh-Hadamard transform). • A set of random products of arbitrary Pauli rotations. These present roughly the same sta-tistical features as standard quantum chemistry/material Ans¨atze. These circuits usuallyexhibit quite large Cliﬀord densities ( > . Why no other SWAP insertion algorithms?

We also tried to include other SWAP in-sertion algorithms (namely SABRE [LDX18], and A ∗ based approach [ZPW17]), but both of hese methods performed systematically worse than Hirata et al approach generalized to ar-bitrary connectivity (the algorithm described in section 3). Moreover, the execution time ofthe A ∗ approach can sometimes become prohibitive, which makes it unpractical for realisticapplications. Random generation process.

Our random circuit generation process is parameterized by anumber of qubits n and a Cliﬀord density parameter p . Each circuit contains n gates. Foreach gate, with probability 1 − p , we insert a non-Cliﬀord Z rotation on a random qubit. Else,with probability , we insert a random CNOT else a random 1-qubit Cliﬀord gate. Thesecircuits have roughly the same number of CNOTs and 1-qubit Cliﬀords as naively implementedVQE/QAOA Ans¨atze and are therefore representative of typical variational quantum circuits. Figure 6: Benchmarks for random circuits. Circuits are randomly generated using all the n qubitsof the target architecture, n gates and a ﬁxed Cliﬀord density (see section 6). Each point isgenerated by compiling 100 random circuit. Random circuits.

The simplest set of benchmarks to explain is the one over random circuits.When increasing the Cliﬀord density, the average number of entangling gates grows, leadingto a growing linear overhead for the SWAP insertion approach. The approach based on linearboolean operator eventually outperform the SWAP insertion technique when the Cliﬀord densitybecomes large due to the increased proportion of CNOT gates. This approach still requires asynthesis every time a non-CNOT gate is met, hence the large overhead. As expected, the liﬀord based approach quickly outperforms both other approaches when Cliﬀord gates becomepredominant in the circuit since it will have less and less operator to synthesize. Notice howthis approach ultimately achieves a compression (i.e negative CNOT overhead). The qualitativebehavior is comparable for Melbourne and Aspen connectivities. For the all-to-all connectivity,the SWAP insertion is trivial, hence omitted. Interestingly, the Cliﬀord approach is signiﬁcantlyworse than doing nothing in that case, up until the Cliﬀord density is larger than about 85%,in which case it starts compressing the circuit. Our rotation merging pre-optimization strictlyoutperforms the SWAP insertion approach in the two constrained architectures (see cliﬀord † and cliﬀord (cid:63) † ). Table 1: Compilation of a collection of standard circuits for Melbourne architecture. circuit init swap linear cliﬀord cliﬀord (cid:63) cliﬀord † cliﬀord (cid:63) † tof 3 18 116.7% 39 72.2% 31 50.0% 27 61.1% 29 0.0% 18 11.1% 20barenco tof 3 24 75.0% 42 25.0% 30 4.2% 25 8.3% 26 -41.7% 14 -41.7% 14mod5 4 28 117.9% 61 35.7% 38 7.1% 30 -28.6% 20 -25.0% 21 -35.7% 18tof 4 30 110.0% 63 76.7% 53 33.3% 40 56.7% 47 -3.3% 29 0.0% 30tof 5 42 135.7% 99 200.0% 126 176.2% 116 83.3% 77 119.0% 92 64.3% 69qft 4 46 176.1% 127 45.7% 67 26.1% 58 4.3% 48 -30.4% 32 -32.6% 31barenco tof 4 48 112.5% 102 131.2% 111 29.2% 62 70.8% 82 -31.2% 33 -20.8% 38mod mult 55 48 337.5% 210 341.7% 212 202.1% 145 141.7% 116 145.8% 118 131.2% 111vbe adder 3 70 107.1% 145 30.0% 91 -11.4% 62 -32.9% 47 -55.7% 31 -52.9% 33barenco tof 5 72 112.5% 153 123.6% 161 134.7% 169 61.1% 116 6.9% 77 5.6% 76rc adder 6 93 167.7% 249 48.4% 138 45.2% 135 30.1% 121 -14.0% 80 -4.3% 89gf2ˆ 4 mult 99 209.1% 306 263.6% 360 202.0% 299 63.6% 162 96.0% 194 22.2% 121mod red 21 105 185.7% 300 150.5% 263 112.4% 223 81.0% 190 49.5% 157 24.8% 131hwb6 116 196.6% 344 144.8% 284 62.1% 188 61.2% 187 23.3% 143 18.1% 137grover 5 288 116.7% 624 158.7% 745 207.3% 885 174.7% 791 33.3% 384 6.6% 307hwb8 7129 227.0% 23311 126.5% 16144 154.0% 18110 127.9% 16246 77.5% 12654 76.4% 12574 Table 2: Compilation of a collection of standard circuits for Aspen architecture. circuit init swap linear cliﬀord cliﬀord (cid:63) cliﬀord † cliﬀord (cid:63) † tof 3 18 116.7% 39 72.2% 31 50.0% 27 61.1% 29 0.0% 18 11.1% 20barenco tof 3 24 75.0% 42 25.0% 30 4.2% 25 8.3% 26 -41.7% 14 -41.7% 14mod5 4 28 117.9% 61 35.7% 38 7.1% 30 -28.6% 20 -25.0% 21 -35.7% 18tof 4 30 110.0% 63 76.7% 53 33.3% 40 56.7% 47 -3.3% 29 0.0% 30tof 5 42 171.4% 114 159.5% 109 119.0% 92 157.1% 108 54.8% 65 81.0% 76qft 4 46 176.1% 127 45.7% 67 26.1% 58 4.3% 48 -30.4% 32 -32.6% 31barenco tof 4 48 112.5% 102 131.2% 111 29.2% 62 58.3% 76 -31.2% 33 -20.8% 38mod mult 55 48 218.8% 153 331.2% 207 133.3% 112 64.6% 79 52.1% 73 60.4% 77vbe adder 3 70 145.7% 172 67.1% 117 32.9% 93 25.7% 88 -30.0% 49 -27.1% 51barenco tof 5 72 137.5% 171 287.5% 279 152.8% 182 152.8% 182 45.8% 105 41.7% 102rc adder 6 93 190.3% 270 153.8% 236 80.6% 168 38.7% 129 37.6% 128 79.6% 167gf2ˆ 4 mult 99 254.5% 351 367.7% 463 284.8% 381 119.2% 217 139.4% 237 72.7% 171mod red 21 105 174.3% 288 236.2% 353 125.7% 237 108.6% 219 57.1% 165 59.0% 167hwb6 116 178.4% 323 143.1% 282 66.4% 193 47.4% 171 13.8% 132 4.3% 121grover 5 288 119.8% 633 228.1% 945 186.5% 825 110.4% 606 25.3% 361 7.3% 309hwb8 7129 206.2% 21829 180.1% 19970 203.6% 21642 160.7% 18585 65.5% 11798 60.3% 11430 circuit init linear cliﬀord cliﬀord (cid:63) cliﬀord † cliﬀord (cid:63) † tof 3 18 0.0% 18 -33.3% 12 -44.4% 10 -61.1% 7 -61.1% 7barenco tof 3 24 -8.3% 22 -12.5% 21 -41.7% 14 -50.0% 12 -54.2% 11mod5 4 28 -10.7% 25 -35.7% 18 -46.4% 15 -57.1% 12 -57.1% 12tof 4 30 -3.3% 29 -13.3% 26 -30.0% 21 -43.3% 17 -53.3% 14tof 5 42 -2.4% 41 -14.3% 36 -33.3% 28 -38.1% 26 -33.3% 28qft 4 46 -32.6% 31 -28.3% 33 -41.3% 27 -52.2% 22 -58.7% 19barenco tof 4 48 -4.2% 46 -2.1% 47 -22.9% 37 -41.7% 28 -47.9% 25mod mult 55 48 -6.2% 45 -14.6% 41 -25.0% 36 -47.9% 25 -39.6% 29vbe adder 3 70 -18.6% 57 -21.4% 55 -51.4% 34 -70.0% 21 -65.7% 24barenco tof 5 72 -8.3% 66 -4.2% 69 -27.8% 52 -37.5% 45 -48.6% 37rc adder 6 93 -4.3% 89 -2.2% 91 -15.1% 79 -28.0% 67 -14.0% 80gf2ˆ 4 mult 99 23.2% 122 35.4% 134 -20.2% 79 -35.4% 64 -39.4% 60mod red 21 105 0.0% 105 37.1% 144 -21.9% 82 -7.6% 97 -24.8% 79hwb6 116 -2.6% 113 8.6% 126 -16.4% 97 -36.2% 74 -35.3% 75grover 5 288 -5.2% 273 10.1% 317 4.5% 301 8.7% 313 2.1% 294hwb8 7129 -2.7% 6939 65.7% 11810 20.1% 8562 9.0% 7769 1.0% 7197 Standard circuits.

Without surprises, the Cliﬀord based approach outperforms almost sys-tematically the other two approaches in the case of limited connectivity. Interestingly, for anall-to-all connectivity, the linear based approach seems to behave well and achieves CNOT countreduction where the Cliﬀord approach fails to (see grover 5 and hwb8). On this class of cir-cuits, our pre-optimizations largely reduce the compilation overhead. Since these circuits arecomposed of Cliﬀord and T gates, merging rotations is really beneﬁcial. Most merges produceCliﬀord rotations that do not contribute to CNOT overhead in the compiled circuit.

MAX-k-LIN-2 and random Pauli sequences.

For both of these benchmarks the inputcircuits have a quite high Cliﬀord density, since they are based on an initial naive implemen-tation of a sequence of Pauli rotations. It is interesting to notice that since the MAX-k-LIN-2circuits roughly correspond to phase polynomials, the linear and Cliﬀord approaches have verycomparable behaviors. The Cliﬀord approach, however, can beneﬁt from the rotation reorder-ing optimization. With this optimization, the Cliﬀord approach becomes overwhelmingly betterthan the SWAP or linear approach. Notice that the ﬁrst point(s) of the graph (Hammingweight of 2) corresponds exactly to MAX-CUT QAOA circuits which are often taken as stan-dard circuits for NISQ era applications. On these circuits, our method with rotation reordering,presents a two fold improvement compared to the SWAP insertion approach. In the randomPauli setting, the Cliﬀord approach, by itself, systematically beats the two other approaches.The rotation reordering optimization does not bring signiﬁcant improvements compared to thestandard Cliﬀord approach.

Beyond Cliﬀord.

It is not clear how to extend this approach to groups larger than the Cliﬀordgroup. It might be worth to investigate the higher level of the Cliﬀord hierarchy and deviseextraction routines for these of operators, even though they do not exhibit the same propertiesas the Cliﬀord group.

Gaussian operators.

A potential candidate is the group of Gaussian operators. This groupcorresponds to operators that can be implemented via circuits of matchgates . It has the nicefeature of stabilizing the hierarchy of Hamiltonians that are bounded degree polynomials over a

Cliﬀord algebra [JM08]. It happens that these types of operators are the main ingredient usedto construct UCCSD Ans¨atze for Fermionic dynamics (including VQE circuits for quantumchemistry or material science). For instance, in the quantum chemistry setting, this resultentails that one can pull all single excitation terms of an Ansatz to the end of the Ansatz, andconjugate the ﬁnal Hamiltonian with these terms. The resulting circuit will have a reducednumber of terms to implement, but these terms might be harder to implement. Hence it is notobvious that one can gain anything by synthesizing these via a naive approach.

In this present work, we only developed algorithms that try to reduce the overall CNOT countof the output circuit. It is of course possible to change the metric to take into account moreaspects of the ﬁnal circuit. A good start would be to use ﬁner hardware models and (roughly)compute the ﬁdelity of each produced sub-circuit, picking the most faithful one. This simpleapproach has been proven to improve the overall circuit ﬁdelity compared to straightforwardgate count minimization in the SWAP insertion setting. Similarly one can also aim at reducingentangling depth instead of entangling gate count.

In this work, we take a local (with ﬁnite depth search) approach to tackle the problem of (re-)synthesis of a sequence of Pauli rotations. It could be interesting to apply global techniquesfor the synthesis of groups of commuting rotations to solve this problem. This would probablylead to better results for standard Cliﬀord + T circuits. It remains unclear if these approachescan behave well for NISQ era circuits. A recent work by Gheorghiu et al [GLMM20] tacklesthe problem by extracting and re-synthesizing phase polynomials out of the input circuit. Thisapproach seems to perform quite well on some circuits and far worse than our method onothers. For instance, Table 4 sums up the performances of their two splitting heuristics and ourSWAP and Cliﬀord based compilers on a 3 × × (cid:63) † CNOT-OPT-A CNOT-OPT-Btof 5 42 128.6% 11.9% 140.82% 138.78%mod mult 55 48 168.8% 14.6% 321.82% 203.64%barenco tof 5 72 125.0% -36.1% 245.24% 140.48%grover 5 288 108.3% 3.8% 116.67% 105.36%

We presented a meta-heuristic called lazy synthesis that exploits eﬃcient representations ofelements in a subgroup of the unitary group in order to compile an input quantum circuitinto an architecture compliant circuit. We showed how this meta-heuristic can be used toreformulate a standard SWAP insertion algorithm from the literature and produced two newcompilation algorithms based on the partial synthesis of linear Boolean operators and Cliﬀordoperators. Finally, we ran benchmarks on various classes of circuits, providing evidence thatthese algorithms are competitive in a NISQ setting.While our algorithms seems to be well behaved on NISQ oriented quantum circuits, it remainsunclear of their scalability to tackle very large Cliﬀord + T quantum circuits. It is very likelythat their inherently local structure will hinder performances on large circuits.

Acknowledgments

This work was supported in part by the French National Research Agency (ANR) under theresearch project SoftQPRO ANR-17-CE25-0009-02, and by the DGE of the French Ministry ofIndustry under the research project PIAGDN/QuantEx P163746-484124.

References [AAM18] Matthew Amy, Parsiad Azimzadeh, and Michele Mosca. On the controlled-notcomplexity of controlled-not–phase circuits.

Quantum Science and Technology ,4(1):015002, 2018.[AG04] Scott Aaronson and Daniel Gottesman. Improved simulation of stabilizer circuits.

Physical Review A , 70(5), Nov 2004.[AG20] Matthew Amy and Vlad Gheorghiu. staq—a full-stack quantum processing toolkit.

Quantum Science and Technology , 5(3):034016, Jun 2020.[BM20] Sergey Bravyi and Dmitri Maslov. Hadamard-free circuits expose the structure ofthe cliﬀord group. arXiv preprint arXiv:2003.09412 , 2020.[CSU19] Andrew M. Childs, E. Schoute, and Cem M. Unsal. Circuit transformations forquantum architectures.

ArXiv , abs/1902.09102, 2019.[dB11] Niel de Beaudrap. A linearized stabilizer formalism for systems of ﬁnite dimension,2011.[dBBV +

20] Timoth´ee Goubault de Brugi`ere, Marc Baboulin, Benoˆıt Valiron, Simon Martiel,and Cyril Allouche. Quantum cnot circuits synthesis for nisq architectures usingthe syndrome decoding problem. In

International Conference on Reversible Com-putation , pages 189–205. Springer, 2020.[GLMM20] Vlad Gheorghiu, Sarah Meng Li, Michele Mosca, and Priyanka Mukhopadhyay.Reducing the cnot count for cliﬀord+t circuits on nisq architectures, 2020. HNYN11] Yuichi Hirata, Masaki Nakanishi, Shigeru Yamashita, and Yasuhiko Nakashima. Aneﬃcient conversion of quantum circuits to a linear nearest neighbor architecture.

Quantum Information & Computation , 11(1&2):142–166, 2011.[JM08] Richard Jozsa and Akimasa Miyake. Matchgates and classical simulation of quan-tum circuits.

Proceedings of the Royal Society A: Mathematical, Physical and En-gineering Sciences , 464(2100):3089–3106, Jul 2008.[KMS07] Samuel A Kutin, David Petrie Moulton, and Lawren M Smithline. Computationat a distance. arXiv preprint quant-ph/0701194 , 2007.[KvdG19] Aleks Kissinger and Arianne Meijer van de Griend. Cnot circuit extraction fortopologically-constrained quantum memories, 2019.[LDX18] Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit mapping problem fornisq-era quantum devices, 2018.[Lit19] Daniel Litinski. Magic state distillation: Not as costly as you think.

Quantum ,3:205, Dec 2019.[NGM20] Beatrice Nash, Vlad Gheorghiu, and Michele Mosca. Quantum circuit optimizationsfor nisq architectures.

Quantum Science and Technology , 5(2):025010, Mar 2020.[PMH08] Ketan N Patel, Igor L Markov, and John P Hayes. Optimal synthesis of linearreversible circuits.

Quantum Information & Computation , 8(3):282–294, 2008.[Pre18] John Preskill. Quantum computing in the nisq era and beyond.

Quantum , 2:79,Aug 2018.[SSP13] A. Shafaei, M. Saeedi, and M. Pedram. Optimization of quantum circuitsfor interaction distance in linear nearest neighbor architectures. In , pages 1–6, 2013.[vdBT20] Ewout van den Berg and Kristan Temme. Circuit optimization of hamiltoniansimulation by simultaneous diagonalization of pauli clusters.

Quantum , 4:322, Sep2020.[vdGD20] Arianne Meijer van de Griend and Ross Duncan. Architecture-aware synthesis ofphase polynomials for nisq devices, 2020.[ZPW17] Alwin Zulehner, Alexandru Paler, and Robert Wille. An eﬃcient methodology formapping quantum circuits to the ibm qx architectures, 2017.

A Dealing with the ﬁnal operator

In this section we detail how to classically emulate any ﬁnal non-trivial Cliﬀord operator. Thisencompasses the case of a ﬁnal permutation or linear operator, even though the case of a ﬁnalpermutation can be trivially dealt with.

A.1 Expected value of some observable

In this setting, we assume that we are given as input, both the circuit C in to compile andsome ﬁnal observable H to evaluate and the end of the circuit execution. In short, we need tocompute: (cid:104) | C † in HC in | (cid:105) Using either the linear operator synthesis approach of the Cliﬀord approach, we end up producinga circuit C out and a ﬁnal linear/Cliﬀord operator A such that: (cid:104) | C † in HC in | (cid:105) = (cid:104) | C † out A † HAC out | (cid:105) Lets further assume that H is given to us in the Pauli basis. That is: H = (cid:88) i α i P i ith α i some real coeﬃcients, and P i ∈ P n some Pauli operators. Then, sampling the newobservable A † HA = (cid:80) i α i A † P i A = (cid:80) i α i P (cid:48) i on the output circuit is equivalent to sampling theinput observable on the input circuit.Sampling this observable using the standard techniques of co-diagonalization of its terms isno more costly (in terms of shots) than sampling the original H . A.2 Sampling bit-strings

If we are required to provide some samples taken according to the ﬁnal distribution induced by C in | (cid:105) , things are bit trickier.Our algorithms output a pair C out , A such that AC out = C in and we would like to emulatesampling of AC out | (cid:105) = C in | (cid:105) . To do so we proceed as follows.Deﬁning Z = { Z i , i ∈ [ n ] } the set of local Z operators on each qubit. Sampling bit-stringsout of some quantum state over n qubits boils down to iteratively evaluating the value of theseoperators in any order (since they commute). We would like to evaluate this collection ofoperators on state AC out | (cid:105) . This is equivalent to evaluating the collection of Pauli operators A † Z A = { P i = A † Z i A, i ∈ [ n ] } on state C out | (cid:105) . These operators, however, might not bediagonal operators, and thus cannot be directly evaluated using standard computational basismeasurements. Nevertheless, these operators commute with one another since the Z i commute.Hence, one can co-diagonalize them using a Cliﬀord circuit C diag . By construction, the newcollection of Pauli operators { s i Q i = C diag P i C † diag , i ∈ [ n ] } are diagonal operators (henceproducts of Z j and I j ) times some phase s i = ±

1. Sampling these operators on state C diag C out | (cid:105) is, by construction, equivalent to evaluating operators in Z over state AC out | (cid:105) . Figure 9 depictsthis sequence of conjugations.Since the Q i are products of Z j and I j operators, they can be seen as computing paritiesover a subset of qubits. This gives us a simple algorithm to ﬁx measurement results. We cansample some bit-string w out of the quantum state C diag C out | (cid:105) and output a new bit-string w (cid:48) with w (cid:48) i = δ − s i ⊕ (cid:80) j ∈ Q i w j where the sum is modulo 2. This operation boils down to applyingan aﬃne system over F n described by the ( s i , Q i ) operators.For example, let’s assume that we need to sample bit-strings over 2 qubits. Let’s assumethat after conjugation through A and co-diagonalization, we get operators Q = Z ⊗ Z , s = − Q = Z ⊗ I , s = 1. These operators can be summed up via the following aﬃne systemover F : x (cid:55)→ Lx + b with L = (cid:18) (cid:19) and b = (1 , T . Any bit-string sampled from state C diag C out | (cid:105) can be ﬁxedby applying L and adding b : 00 (cid:55)→ (cid:55)→ (cid:55)→ (cid:55)→ C out A Z Z ...... Z n (b) C out P P · · · P n A · · ·· · · (c) C out C diag s Q s Q · · · s n Q n C † diag A · · ·· · · Figure 9: The sampling ﬁxing procedure. (a) we need to emulate sampling of the quantum state AC out | (cid:105) = C in | (cid:105) . This sampling procedure relies on the joint measurements of operators Z i foreach qubit i . (b) Since A is Cliﬀord, we can commute the Z i with A , yielding a collection ofcommuting operators P i (non necessarily diagonal). (c) These operators can be jointly measuredby co-diagonalizing them via a Cliﬀord circuit C diag , yielding a collection of diagonal operators s i Q i where Q i are products of Z operators and s i = ± s i Q i are measured asimple linear system inversion allows us to emulate the sampling of the initial Z i operators. Hence,in practice, only C out and C diag are eﬀectively performed on the quantum processor. Remark on co-diagonalization.

To the best of our knowledge, [vdBT20] provides the bestapproach to produce a co-diagonalization circuit. They show that this task can be reduced tothe synthesis of a linear Boolean operator which has a worst case complexity of O ( n / log n ) inthe case of non constrained architecture. A simple application of any architecture aware CNOTsynthesis heuristics, like [dBBV + C diag .This argument is essential to claim that, in most of the cases, it is more eﬃcient to synthesize C diag rather than directly synthesizing A . Indeed, the synthesis of an arbitrary Cliﬀord operatoris usually done by the synthesis of successive layers of Hadamard gates, Phase gates, CNOTgates or CZ gates. Most recent results show that three layers of two-qubit gates are necessary[BM20], making it more aﬀordable to use the co-diagonalization process. B Recursive search of ﬁnite depth

In the three algorithms described in this paper, the extraction functions perform a recursivesearch over the next w calls to the extraction in order to locally pick the subcircuit that willgenerate the least extraction overhead. This recursive search is introduced in [HNYN11] forSWAP insertion and can be easily transposed to the linear Boolean operator and Cliﬀord setting.In practice, this is done by computing a tree of depth w containing all the possible choicesand associating to each leaf of this tree the sum of all sub-circuits scores on the path from theroot to the leaf. In the algorithms presented in this paper, we simply used the CNOT countas a metric, but any other metric, such as overall ﬁdelity, or depth can be used in this search.Figure 10 depicts two search trees of depth 0 and 1. a b c +6 +6 a b +3 a +6 +8 a b +12 +12 +9 +12 +14Figure 10: Search trees of depth 0 and 1. In the ﬁrst tree, we stop the recursive search at depth 0.In this situation all possible choices a , b ,and c are equivalent since they produce sub-circuitsof score 6. Hence, we greedily pick the ﬁrst choice a . After exploring at depth 1, we notice thatchoosing option bb