Architecture aware compilation of quantum circuits via lazy synthesis
AArchitecture aware compilation of quantum circuitsvia lazy synthesis
Simon MartielAtos Quantum Lab.Les Clayes-sous-bois, France Timoth´ee Goubault de Brugi`ereLaboratoire de Recherche en Informatique,Orsay, FranceDecember 18, 2020
Abstract
Qubit routing is a key problematic related to quantum circuit compilation. It consists inrewriting a quantum circuit by adding the least possible number of instructions to make thecircuit compliant with some architecture’s connectivity constraints. Usually, this problem istackled via either SWAP insertion techniques or re-synthesis of portions of the circuit usingarchitecture aware synthesis algorithms. In this work, we propose a meta-heuristic that couplesthe iterative approach of SWAP insertion techniques with greedy architecture aware synthesisroutines. We propose two concrete compilation algorithms based on this meta-heuristic andcompare their performances to SWAP insertion techniques for several standard classes of quan-tum circuits. We compare our approach with existing universal compilation techniques andshow significant reduction in the entangling gate overhead due to compilation.
Compilation is a key step in any software stack. Programs are often specified using a high-levelprogramming language that allows the programmer to describe the manipulation of the proces-sor’s memory using abstract structures. This high-level description is then refined, sometimesin several stages, until it can be fully expressed as a sequence of low level instructions that canbe executed by the processor. Quantum programming makes no exception. In order to leveragethe power of a quantum processor, one needs to compile high-level quantum programs into lowerlevel sequences of quantum instructions. This compilation step is particularly critical in the caseof so called NISQ processors [Pre18]. In these settings, the quantum instructions are prone toerrors and the quantum memory undergoes decoherence phenomena leading to quite large errorrates. Consequently, there is a strong need for efficient heuristics to reduce the instruction countwhile still satisfying the architecture’s constraints.One of the most challenging problem in the field of compilation of quantum programs isthe qubit routing problem. Most quantum processors come with a limited chip connectivity,only allowing a (usually) small number of couplings between the different qubits. The inputcircuit should therefore be altered in order to only make use of the available interactions. Thisproblem is traditionally tackled via the insertion of additional SWAP gates inside the circuit inorder to move logical qubits from one physical qubit to the other [HNYN11, SSP13, LDX18,ZPW17, CSU19]. These techniques are inherently inefficient in the sense that they can only addgates to the compiled circuits and usually ignore the nature of the computation. Most of thesealgorithms lead to quite large SWAP/CNOT overheads when compared to the original circuitsize. These overheads can be detrimental to the success rate of the algorithm.More recently, people started investigating the transverse approach of synthesizing quantumcircuits that are readily compliant with a given connectivity. This approach is usually restricted a r X i v : . [ qu a n t - ph ] D ec o a particular subset of quantum circuits such as Boolean linear operators [KvdG19, dBBV + • Pick a subgroup of unitary operators that are easy to represent classically. By easy wemean that their classical representation has a polynomial size in the number of qubits andcan be efficiently updated for composition. • Initialize a data structure representing the identity. • Iterate over the input circuit: – if the incoming gate belongs to the subgroup, update the current data structure withthis gate, – if not, figure out a way to synthesize a piece of the current data structure into a circuitsuch that one can safely insert the incoming gate in the output circuit.This paper is organized as follows. We start by formalizing the above succinctly describedmeta-algorithm using what we call the lazy synthesis framework. Section 3 shows how a stan-dard SWAP insertion algorithm from [HNYN11] fits into this framework. We then extend thisalgorithm by using the group of linear Boolean reversible operators in section 4 and the Clif-ford group in section 5. Some benchmarks against standard classes of circuits are providedand discussed in section 6, together with a comparison with recent works in general purposecompilation. Finally, we propose some possible extensions, and conclude in a last section. In this section we present a general formulation of the lazy synthesis meta-heuristic.
Notations: • Circuits are words on a (potentially infinite) gate set. We use :: for concatenation, and ε for the empty circuit. • Given some gate g , we denote by ˜ g its corresponding n -qubits unitary operator, andextend this notation to circuits. For instance, given a circuit c = g :: g as a word, thecorresponding equation in U (2 n ) is ˜ c = ˜ g · ˜ g where · stands for the standard linearoperator composition.To introduce our framework, we first need to introduce some simple conventions. We willassume that the input circuit is a sequence of gates taken from a set G in , and that the outputcircuit should have gates in another gate set G out . Here, we voluntarily use a quite broad notionof gate set. For instance, G out could contain the exact same gates as G in but with additionalconstraints, such as connectivity constraints. We will also assume that we have access to somedata structure D = (cid:104)H , (cid:74) . (cid:75) , S, u, e (cid:105) representing a class of unitary operators, with the followingconstraints: H is some set of classical descriptions. We will usually require these descriptions to besmall (i.e polynomial in the number of qubits and/or the number of input gates). • (cid:74) . (cid:75) : H → U (2 n ) is an interpretation of the descriptions in H as unitary operators. • S ⊆ G in is a subset of the input gate set. Our data structure D corresponds to the classof operators that can be implemented by circuits with gates from S . • u : H × S → H is an update function that verifies: (cid:74) u ( h, g ) (cid:75) = ˜ g. (cid:74) h (cid:75) that is, u is sound with respect to (cid:74) . (cid:75) . Less formally u updates h into u ( g, h ) by absorbing g into h . We will usually require for u to efficiently update h (i.e runs in polynomial timew.r.t. the size of h ). • e : H × S → H × G ∗ out , where S is the complement of S in G in . The function e is anextraction function that verifies: h (cid:48) , c = e ( h, g ) = ⇒ ˜ g. (cid:74) h (cid:75) = (cid:74) h (cid:48) (cid:75) . ˜ c Less formally, e tells us how to commute g with h as the cost of updating h into h (cid:48) andturning g into a sub-circuit c . We will usually require e to be efficient.Equipped with such a data structure, we can describe our meta-heuristic as the simple recipedetailed in Algorithm 1. Algorithm 1
Lazy synthesis meta-heuristic procedure LazySynth ( c in ) h ← Id c out ← ε for g in c in do if g ∈ S then h ← u ( h, g ) else h (cid:48) , c = e ( h, g ) h ← h (cid:48) c out ← c out :: c end if end for return h, c out end procedure The main idea of the heuristic is to iteratively aggregate gates of c in in h and c out whilemaintaining the invariant: ˜ c in [1 ..i ] = (cid:74) h (cid:75) · ˜ c out . That is: after compiling gate i , the initialsegment c in [1 ..i ] is equivalent to the composition of the current output circuit c out followed bythe current stored operator h . It is easy to check the soundness of the algorithm using theexpected properties of u and e . The process of our meta-heuristic is illustrated in Fig 1.In other words, the gates in S are the ones we want to avoid executing by the quantumprocessor. As they belong to a group of efficiently simulable operators, our goal is to keep trackclassically of their action on the memory as long as possible with the use of our update function u . When a gate g not belonging to S arises in the circuit, we try to minimize the quantity ofextra gates needed to execute g while keeping the functionality of the global operator. This isthe goal of the extraction function e .As you can notice, all the complexity of the heuristic lies in the implementation of theupdate and extraction functions u and e . These functions will heavily rely on the underlyingdata structures. out (cid:74) T (cid:75) g c in g ∈ S g (cid:54)∈ Sc out (cid:74) u ( T, g ) (cid:75) c in Updated circuitCurrent circuit G -compatible Notexecuted Nextgate c out c corr (cid:74) T (cid:48) (cid:75) c in G -compatible+shortnew c out T (cid:48) , c corr = e ( T, g ) Figure 1: Illustration of Algorithm 1. At any stage of the algorithm, we have the invariant˜ c in (cid:74) T (cid:75) ˜ c out which is equal to the operator implemented by the input quantum circuit.4 n the next section, we show how to embed a SWAP insertion technique described in[HNYN11] into this framework. Later we will extend it to a broader set of operators to im-prove its performances. In [HNYN11], the authors propose a heuristic to iteratively rewrite a quantum circuit by in-serting SWAP gates to route logical qubits. In this approach, we will rely on the fact thatelements in the group S n can be efficiently represented and manipulated. In order to representan element σ ∈ S n , we can simply store an array of integers [ σ (1) , ..., σ ( n )]. Moreover, given therepresentations of two permutations σ and π , the representation of σ ◦ π is simple to compute. Data structures.
We now describe how this algorithm is a particular case of our framework.We first need to define G in , G out , and S ⊆ G in : • G in contains any gate acting on at most 2 qubits, • G out contains any gate acting on at most 2 qubits and such that the gate is compatiblewith some connectivity graph G , • finally S = { SWAP i,j , i, j ∈ [ n ] } the set of all possible qubit SWAPs.The classical data structure simply describes a qubit permutation: D = (cid:104) S n , (cid:74) . (cid:75) , S, u, e (cid:105) , where: • S n denotes the permutation group over n elements, where n = | V ( G ) | is the number ofqubits. • (cid:74) . (cid:75) trivially associates to a permutation the corresponding n -qubit unitary operator • u composes the current permutation with an incoming swap: u ( σ, SWAP i,j ) = ( i, j ) ◦ σ We now describe our extraction routine. Given some gate g in the input circuit. If g is suchthat σ − ( g ) is compatible with G , we can simply use the fact that: g. (cid:74) σ (cid:75) = (cid:74) σ (cid:75) .σ − ( g )to set e ( σ, g ) = σ, σ − ( g ). However, if σ − ( g ) is not compatible with G , we need to producea piece of G compatible SWAP circuit c π implementing a permutation π such that σ (cid:48)− ( g ) iscompatible with G , with σ (cid:48) = σ ◦ π − . Then, we have that: g. (cid:74) σ (cid:75) = (cid:113) σ ◦ π − (cid:121) .σ (cid:48)− ( g ) . (cid:74) π (cid:75) = (cid:113) σ ◦ π − (cid:121) .σ (cid:48)− ( g ) . ˜ c π If we can produce such a circuit c π , we can set e ( σ, g ) = σ ◦ π − , c π :: σ (cid:48)− ( g ). We nowdescribe how such a SWAP circuit is produced in Hirata et al algorithm.Considering the fact that we need gate σ (cid:48)− ( g ) = ( π ◦ σ − )( g ) to be compatible with G , π can be seen as a permutation bringing the qubits of σ − ( g ) close to one another in G . Let a, b bethe pair of qubit on which g acts and let p = ( σ − ( a ) = p , ..., p k = σ − ( b )) be the shortest pathfrom σ − ( a ) to σ − ( b ) in G . The algorithm enumerates k − σ − ( a ) toward σ − ( b ) along p and vice-versa until they meet somewhere along an edge of p . Foreach of these permutations, the algorithm is called recursively for the next w entangling gates,and the permutation leading to the lowest SWAP overhead is picked and committed to theoutput circuit, thus producing c π . Figure 2 gives such an example or permutation enumeration.The general structure of such a recursive search is described in Appendix B. As expected, theperformances of this algorithm heavily depend on the recursion depth parameter w .The overall worst case complexity of this algorithm is O ( mn w ), with m the number ofentangling gates and n the number of qubits, and neglecting the pre-computing of shortest-paths. × , , ,
7) (dashed edges in (a)). We then explore three differentpermutations, each generated by k − w entangling gates for some fixed parameter w . Now, using the lazy-synthesis framework to describe a SWAP insertion algorithm may seema bit tedious and unnecessary. In this section, we show how, by extending our classical datastructure, we can generalize Hirata et al approach to outperform it in some settings.
We consider the set of reversible circuits over n qubits comprising only CNOT gates. This setgenerates the entire set of reversible linear Boolean operators over n variables, and in particularcontains the set of all n elements permutations. This set has a lot of nice properties: it is easyto represent its elements via some n × n Boolean tables, each row representing an output parityof the circuit [AAM18]. More precisely, given a linear reversible operator A ∈ F n × n acting on n qubits at initial values x = ( x , x , ..., x n − ) , x i ∈ { , } , the logical value of the i-th qubit afterexecution of A is given by α x ⊕ α x ⊕ ... ⊕ α n − x n − where α = A [ i, :] is the i-th row of A and ⊕ stands for the XOR operation. Therefore we cankeep track with a polynomially-sized structured of the action of CNOT circuits on the quantummemory.Moreover, it is simple to update such tables via some row (resp. column) operations toaccommodate for left (resp. right composition) of the operator by a CNOT [PMH08]. Moregenerally, given an initial table A and a linear reversible circuit implementing a table B , theupdated table is given by BA . Lazy linear synthesis . Our gate sets are defined as follows: • G in contains any 1-qubit gate and CNOT gates on arbitrary pairs of qubits, thus alsoincluding SWAPs, • G out contains any 1-qubit gate and CNOT gates compatible with some connectivity graph G , • finally S = { CN OT i,j | i, j ∈ V ( G ) } is the set of CNOT gatesThe classical data structure describes reversible linear boolean operators over n = | V ( G ) | qubits: • H is the set of invertible n by n boolean matrices, • (cid:74) . (cid:75) trivially associates to a linear operator the corresponding n -qubit unitary operator, u updates a table as expected with a matrix/matrix product: u ( A, CN OT i,j ) = E i,j .A where E j,i is the table representation of the operator CN OT j,i given by the identity matrixwith one additional 1 at row i , column j . In practice, given the simple structure of the E i,j operators, we recover the property that the action of a left-composition by a CNOToperator on A is equivalent to a row operation on the table A .Given some incoming 1-qubit gate g acting on qubit q and some linear operator A , thebehavior of our extraction routine relies on the following two properties:1. if A has shape: A = q B (cid:48) ... B (cid:48)(cid:48) q · · · · · · B (cid:48)(cid:48)(cid:48) ... B (cid:48)(cid:48)(cid:48)(cid:48) (1)then A acts as the identity on qubit q . Consequently, any 1-qubit gate acting on qubit q can commute with A .2. For any B ∈ F invertible, we have the relation (cid:74) A (cid:75) = (cid:113) ABB − (cid:121) = (cid:74) AB (cid:75) · (cid:113) B − (cid:121) . This means that if we add a linear reversible circuit implementing B − to our current cir-cuit, then to preserve the functionality of our quantum circuit the classical representationof the qubits is updated by AB .One can always find an operator B such that AB has the shape given by Eq. (1). Givensuch an operator B , we have˜ g · (cid:74) A (cid:75) = property 2 ˜ g · (cid:74) AB (cid:75) · (cid:113) B − (cid:121) = property 1 (cid:74) AB (cid:75) · ˜ g · (cid:113) B − (cid:121) . Hence, we define our extraction function e as: e ( A, g ) = (
AB, c :: g )where B is such that AB satisfies Eq. (1) and c is a G -compatible circuit implementing B − .In fact, we can slightly relax the structure of AB and apply g on a qubit different thanqubit q . Indeed, considering another qubit q (cid:48) (cid:54) = q and writing S q,q (cid:48) the Boolean linear operatorassociated to the swapping operator of qubits q and q (cid:48) , we have˜ g · (cid:74) A (cid:75) = (cid:74) AB (cid:75) · ˜ g · (cid:113) B − (cid:121) = (cid:74) AB (cid:75) · ˜ g · (cid:74) S q,q (cid:48) (cid:75) · (cid:74) S q,q (cid:48) (cid:75) · (cid:113) B − (cid:121) = (cid:74) A ( BS q,q (cid:48) ) (cid:75) · ˜ g (cid:48) · (cid:113) ( BS q,q (cid:48) ) − (cid:121) (2)where g (cid:48) is the gate g executed on qubit q (cid:48) . In other words, as long as A has shape (1) up toa permutation of the columns, one can still apply gate g on the qubit q (cid:48) for which A [: , q (cid:48) ] = e q .Our goal now is to find a suitable operator B such that c is the smallest possible. We providea heuristic to construct such a circuit. .2 Partial synthesis routine In order to simplify the description of our heuristic, we can first remark that the shape (1) thatwe would like to achieve is stable by inverse. That is, finding B such that AB has shape (1)is equivalent to finding B − such that B − A − has shape (1). So instead of working on thecolumns of A we can work on the rows of A − and directly compute a quantum circuit for B − .Notably, due to Eq. (2), the freedom we have in the choice of the column for reducing A toshape (1) is now a freedom in the choice of the row of A − .Given some incoming 1-qubit gate acting on qubit q , our heuristic works in two stages: • We start by setting one row of A − to e Tq . By definition of the inverse, the q -th row of A produces a bit vector describing which wire of the circuit should be fold using a fan-inCNOT (i.e a cascade of CNOT gates that share the same target) onto one of them in orderto produce { e q } on A − . By Eq. (2) we can choose any of the wire q (cid:48) for which A [ q, q (cid:48) ] = 1. • After choosing a suitable qubit q (cid:48) and updating A − accordingly, the q -th column of theoperator can be zeroed by distributing the q (cid:48) -th row onto every row containing a nonzero q -th component. This can be achieved using a single fan-out CNOT (i.e a cascade ofCNOT gates sharing the same control).Hence, we simply need to be able to produce implementations of fan-in and fan-out CNOTgates that are compliant with our connectivity graph.To perform this synthesis we use a relaxed version of the method described in [KvdG19].The idea is the following: • compute y = e Tq .A , y = { y , ..., y k } • compute a Steiner tree of the connectivity graph G , with terminal nodes { y , ..., y k } • pick a terminal node y i and perform algorithm 2. This routine is a straightforward gen-eralization of the nearest-neighbor implementation of a CNOT gate proposed in [KMS07](c.f their Figure 1) that is relaxed to leave intermediate wires in arbitrary states. It actsby pruning leaves of the tree while preserving the invariant that the leaves of the tree mustbe considered as control qubits for the rest of the fan-in synthesis. All CNOT gates usedin the circuit are compliant with the tree’s connectivity, making the circuit compliant withthe qubits connectivity. Figure 3 gives an example of execution of this routine. Algorithm 2
Fan-in along a tree procedure FanIn ( T, y, root ) c out ← ε while | T | > do v ← a leaf of T thats not root u ← the only neighbor of v if u / ∈ y then c out ← c out :: CN OT ( u, v ) end if c out ← c out :: CN OT ( v, u ) T.remove ( v ) end while return c out end procedure Notice that intermediate wires may be left in a different state. Our only goal is to producethe correct parity e q on the root wire, and we take the liberty of freely changing the state of theintermediate wires. The resulting circuit contains 2( l − − k CNOTs where l is the size of the ree and k is the number of terminal vertices (i.e the Hamming weight of y ), including the rootof the tree.Fan-outs are synthesized in a similar fashion, except terminal vertices are found by lookingat lines of the updated operator A (cid:48) that have a non-zero q th component, and algorithm 3 isused to produce a circuit. Algorithm 3
Fan-out along a tree procedure FanOut ( T, y, root ) c out ← ε Ones ← y T (cid:48) ← a copy of T while | T (cid:48) | > do (cid:46) Setting all the vertices of T to 1 v ← a leaf of T (cid:48) u ← the only neighbor of v if u / ∈ Ones then c out ← c out :: CN OT ( v, u ) Ones.insert ( u ) end if T (cid:48) .remove ( v ) end while while | T | > do (cid:46) Getting rid of all 1s (except for root) v ← a leaf of T thats not root u ← the only neighbor of v c out ← c out :: CN OT ( u, v ) T.remove ( v ) end while return c out end procedure This algorithm corresponds exactly to the fill-tree/empty-tree routine of [KvdG19], exceptthat we work on the full hardware graph, and never have to restrict the structure of the Steiner-tree to a “descending“ tree. This approach only works because we heavily rely on the fact thatwe are synthesizing a single row/column and thus allow ourselves to leave intermediate wires inarbitrary states.Both of these routines are quite close to the one used in [KvdG19], except that we allowourselves to be sloppier in the process, and leaving any intermediate qubit in a dirty state,instead of having to preserve invariants when implementing the fan-in/fan-outs.
In practice we improve the algorithm using two independent optimizations.
Dealing with phase gates.
It is unnecessary to zero a column of our current linear operator ifwe just need to insert a phase gate (i.e a diagonal gate). Indeed, since the gate is diagonal, andassuming it is executed on qubit q , it is well-known that the gate commutes with any CNOTwhose target is not q . So the diagonal gate will commute with the subsequent fan-out becauseone can check that the CNOT gates of the fan-out only use the qubit on which the diagonalgate is executed as a control. Hence, this fan-out can be omitted, thus approximately halvingthe number of required CNOT gates. Recursive search of finite depth.
As mentioned at the end of section 4.1, we can synthesizeour operator B up to some column permutation. This gives us some freedom to perform some y y y r • • •• • • y •• • • y •• • • y • Figure 3: Example of a tree and the corresponding fan-in CNOT circuit generated by algorithm2. The terminal vertices are circled. Intermediate vertices are represented as • . Notice that thisroutine can be improved in order to reduce the depth of the fan-in gate. In this work we decidedto focus on CNOT count and thus did not insist on these lower level optimizations. optimizations when picking the qubit that will effectively receive the incoming 1-qubit gate.To leverage this freedom, we can adopt the same strategy as in Hirata et al SWAP insertionalgorithm. In practice, given an incoming gate g acting on qubit q , we: • compute the set y of rows of A − that need to interact in the fan-in CNOT, • generate a Steiner tree with terminal vertices y , • branch over all choices of y i ∈ y to receive gate g Notice that this boils down to trying all possible terminal vertices as root vertices in algorithm2. We then perform a recursive search as described in Appendix B.Overall, including a recursive search of depth w , the worst case time complexity of ouralgorithm grows as O ( mn w ) where m is the number of 1-qubit gates, n the number of qubitsin the target architecture. Notice that the runtime is linear in the input circuit’s size, but growexponentially in the depth of the recursive search. Dealing with the final operator.
In the general case, the final linear operator in our classicaldata structure is not trivial. In a general compilation setting, this is not much of an issue, fortwo reasons: • in the setting where we might have a follow up circuit to compile, one can initialize thelinear operator for the next compilation round to the final operator of the previous round, • if we just finished compiling the final portion of our full quantum algorithm, one canalways fix the sampled data in order to classically emulate the final linear operator. Thisoperation boils down to inverting a simple linear system over F .Moreover, in most NISQ applications, the sampling directive executed at the end of a quantumcircuit are here to estimate the expected value of some Hermitian operator H . Most of the time,this operator is specified in the Pauli basis. Thus, it is enough to compute a new Hermitianoperator A − HA such that sampling this operator at the end of the compiled circuit is equivalentto sampling the original operator at the end of the input circuit, and this new operator has thesame number of terms as the original operator: (cid:104) | C † in HC in | (cid:105) = (cid:104) | C † out (cid:0) A − HA (cid:1) C out | (cid:105) In fact, this property is true for a larger subgroup: the Clifford group, which is tackled inthe following section. The fixing procedure for the sampling and observable cases are detailedin Appendix A in the more general case of Clifford operators. Generalization to routing via lazy synthesis of Cliffordoperators
We now further extend the previous approaches to lazy synthesis of elements of the Cliffordgroup.
The Clifford group, C n , is a natural extension of the class of reversible linear Boolean operators.This group is defined as the largest subgroup of the unitary group that stabilizes the group ofPauli operators P n : C n = { U ∈ U (2 n ) , ∀ P ∈ P n , U † P U ∈ P n } (3)Given a Pauli operator P ∈ P n and a real angle θ ∈ R , we define the Pauli rotation R P ( θ )as: R P ( θ ) = cos( θ/ I − i sin( θ/ P The conjugation property 3 also applies to Pauli rotations, and not only Pauli operators.Hence, for any Pauli rotation of axis P ∈ P n and any angle θ ∈ R , and any U ∈ C n : U † R P ( θ ) U = R U † P U ( θ ) = R P (cid:48) ( s · θ )for some Pauli operator P (cid:48) and some sign s = ± R P ( θ ) U = U R P (cid:48) ( s · θ ) . In fact, this relation can be used to normalize quantum circuits as sequences of non-CliffordPauli rotations (i.e Pauli rotations with angles (cid:54) = kπ/ tableaux that specify how they act by conjugation over generators of the Pauli group [AG04,dB11]. In practice, this means that we can implement a data structure T (a tableau), repre-senting a Clifford operator in C that: • can be easily updated T ← ˜ g · T or T ← T · ˜ g for some Clifford gate g , • can be used to efficiently compute P (cid:55)→ T P T † for some n -qubits Pauli operator P , yieldinganother Pauli operator (and potentially a phase in ± In the following, we define the support of a Pauli operator P as the set of qubits such that P acts non-trivially on them. E.g if P = I ⊗ Z ⊗ X ⊗ I , the the support of P is the set { , } since P acts as the identity on qubits 0 and 3. For ease of notations we will drop the ⊗ operators. emark. In the following subsection, we will use the following simple structure to implement aPauli rotation R P ( θ ). We can first reduce P to a diagonal operator by conjugating it through acircuit composed of local Clifford gates. This circuit can be built by individually diagonalizingeach component of the Pauli operator: • if the operator acts as X on qubit i , insert a H gate on qubit i , • if it acts as Y on qubit i , insert a √ X = R X ( π/
2) on qubit i .The resulting Pauli operator acts either as Z or I on each qubit. Using the identity CNOT · ZZ · CNOT = IZ , one can reduce the support of P to a single qubit via conjugation by a circuitcomposed of | P | − q . In fine, the resulting Cliffordcircuit C verifies R P ( θ ) = C † R Z q ( θ ) C . An example is given in Figure 4. This reduction can beeasily extended to take architecture into account by performing a fan-in CNOT along a Steinertree with the support of the rotation as terminal vertices. (a) XYZI (b)
H HZ √ X √ X † ZZI (c)
H HZ √ X √ X † III (d)
H H R Z ( θ ) √ X √ X † Figure 4: Reduction of a Pauli operator/rotation. (a) the initial Pauli operator. (b) after conjuga-tion via local Cliffords, our operator is diagonal. (c) after conjugation with the appropriate CNOTgates, our operator is localized on a single qubit (here, the first qubit). (d) the final quantumcircuit implementing R XY ZI ( θ ). In that setting we will consider that G in contains only Clifford gates and arbitrary Paulirotations, R P for P ∈ P . G out will contain CN OT, H, R X ( π/ R Z rotations,the CNOTs being restricted to some interaction graph G .In order to use our meta-heuristic, we need to specify our full data structure D = (cid:104)T , (cid:74) . (cid:75) , S, u, e (cid:105) : • T is the set of Clifford operators, or, to be precise, of tableaux representing Cliffordoperators, • (cid:74) . (cid:75) is the standard tableau interpretation, • S is the set of Clifford gates, • u is the update of a tableau using a Clifford gate by left composition: u ( T, g ) = ˜ g · T Our extraction function e acts as follows. Upon encountering a non-Clifford Pauli rotation R P ( θ ): i) Compute a Pauli operator P (cid:48) and a phase s = ± s · P (cid:48) = T † P T (ii) For each qbit i in the support of P (cid:48) , if P (cid:48) [ i ] = Y then perform a R X ( π/
2) gate on i , andif P (cid:48) [ i ] = X , perform a H on i . This produces a Clifford circuit c , comprising only localgates. E.g P (cid:48) = IXY ZI , we produce a circuit c = H :: R X ( π/ .(iii) Pick a target qubit q in the support of P (cid:48) , and perform algorithm 2 in order to generatea fan-in CNOT from all qubits in the support of P (cid:48) to q , thus updating the Cliffordsub-circuit c (iv) Update T by right composition with ˜ c † : T (cid:48) ← T · ˜ c † (v) Return the updated table T (cid:48) and sub-circuit c :: R Z ( s · θ ) q The following proposition about e holds: Proposition 1.
Let T be a tableau and R P ( θ ) be a Pauli rotation. If T (cid:48) , c = e ( T, R P ( θ )) , then R P ( θ ) . (cid:74) T (cid:75) = (cid:74) T (cid:48) (cid:75) . ˜ c Proof.
By construction, we have that: c = c prep :: R Z ( s · θ ) q with c prep and q such that: c prep :: R Z ( s · θ ) q :: c † prep = R P (cid:48) ( s · θ )where s · P (cid:48) = T † P T , and c prep is a Clifford circuit. This implies that ˜ c = ˜ c prep · R P (cid:48) ( s · θ )To be precise, c prep holds the local basis changes and CNOT cascade necessary to the im-plementation of R P (cid:48) ( s · θ ), plus some stray Clifford operators that might have happened duringthe “dirty” fan-in (corresponding to the dashed box in the example circuit below). C R Z ( s · θ ) C † (cid:74) T (cid:75) • • • • C • • C † C • • C † Hence, we have: R P ( θ ) · (cid:74) T (cid:75) = (cid:74) T (cid:75) · (cid:74) T (cid:75) † · R P ( θ ) · (cid:74) T (cid:75) = (cid:74) T (cid:75) · (cid:16) (cid:74) T (cid:75) † · R P ( θ ) · (cid:74) T (cid:75) (cid:17) = (cid:74) T (cid:75) · R T † P T ( θ )= (cid:74) T (cid:75) · R P (cid:48) ( s · θ ) where s.P (cid:48) = T † P T = (cid:74) T (cid:75) · ˜ c prep † · R Z q ( s · θ ) · ˜ c prep = (cid:113) T · ˜ c prep † (cid:121) · ˜ c = (cid:74) T (cid:48) (cid:75) · ˜ c n fine, our final output circuit will always have shape: C out = C (cid:89) i R Z qi ( θ i ) F i L i where C is some Clifford operator, R Z qi ( θ i ) are non-Clifford local Z rotations, F i are architecturecompliant fan-in CNOTs as described by algorithm 2, L i are local Clifford circuits, and q i arethe target qubits used in the Pauli rotation reductions. Recursive search of finite depth.
Notice that, once again, we have some freedom of choicewhen picking the qubit that will receive the R Z rotation. In practice, we perform a recursivesearch of finite depth for the next w rotations to synthesize and pick the host qubit that leads tothe least overhead. The branching is very similar to the one described in 4.3. After computingthe Steiner tree with terminal vertices the support of the rotation we are currently synthesizing,one can choose any terminal vertex to be the target of our fan-in. Once again we refer toAppendix B for more details. The overall worst case complexity is the same as the CNOT case.Indeed, the complexity is dominated by the recursive exploration of a search tree where eachvertex exploration requires the generation of a Steiner tree of the architecture graph. Dealing with the final Clifford operator.
Once again, we are left with a possibly non-trivialfinal Clifford operator C . As stated in the previous section, if one has to compile several piecesof circuits in sequence, one can always initialize the Clifford operator of the next compilationround using C . In the general case where we are done compiling and need to effectively deal withthis operator, we can almost always avoid having to synthesize the full operator C . Section Adescribes how to do so when sampling an observable or sampling bit-strings in the computationalbasis. Rotation merging.
As mentioned in subsection 5.1, any quantum circuit can be reformulatedas a sequence of Pauli rotations with non-Clifford angles (i.e angles (cid:54) = k π ). That is: C (cid:89) i R P i ( θ i )where R P i ( θ i ) are Pauli rotations and C is a final Clifford operator. Moreover, this form canbe efficiently computed by pulling all the Clifford gates at the end of the circuit. Once sucha product is obtained, one can try to merge rotations with identical axis. This can also bedone efficiently by considering each rotation one by one and checking if it can be commuted andmerged with a rotation with an identical axis. This routine is described in Algorithm 4. Noticethat this is not the only way to produce a final ordering of the rotations. In particular when weinsert the un-merged rotation in list L (line 14), one would make a different choice and insertsooner in the list. By inserting it at the end of the list, we might block some other merges bypreventing the next rotations to commute past it. In order to keep this optimization lightweightand reproducible, we keep things simple and insert the rotation at the end of the list.This optimization has several consequences. First, by merging rotations, we reduce thenumber of calls to the partial synthesis routine. Moreover, by merging rotations, one might endup with a rotation with Clifford angle. Such a rotation can then be pulled and the end of thecircuit, effectively removing it from the sequence of rotations to synthesize. This optimizationis a key feature when dealing with Clifford + T circuits where this type of situation occursregularly. This pre-processing has a worst case time complexity of O ( m n ) where m is thenumber of non-Clifford Pauli rotations and n is the number of qubits. lgorithm 4 Rotation merging procedure MergeRotations ( S ) L ← [ ] for R P ( θ ) in S do for R P (cid:48) ( θ (cid:48) ) in reversed( L ) do if P = P (cid:48) then θ (cid:48) ← θ (cid:48) + θ break end if if P and P (cid:48) do not commute then break end if end for if P was not inserted then L ← L :: R P ( θ ) end if end for return L end procedure Rotation reordering.
Another optimization that can easily be computed is the reorderingof consecutive commuting rotations. Given a sequence of Pauli rotations (cid:81) R P i ( θ i ), one canrewrite it as (cid:81) G (cid:81) i ∈ G R P i ( θ i ) where G are groups of commuting rotations. Notice that inthis expression, while the first product is ordered, the second is not. This gives us a leveragefor optimization. In practice, we use a greedy approach consisting in synthesizing the lesscostly rotation first. That is, we compute all the Steiner trees necessary to implement all therotations in a given group and start with the rotation that requires the smallest tree. Groupsof commuting rotations are computed greedily using Algorithm 5. Notice that this is not theonly way to produce such a sequence. In practice, trying harder to form larger groups ofcommuting rotation did not seem to improve the benchmark results, hence the rather simplegreedy heuristic. This pre-processing has a worst case time complexity of O ( m n ) where m isthe number of non-Clifford Pauli rotations and n is the number of qubits. Algorithm 5
Rotation grouping procedure GroupRotations ( S ) L ← [ ] G ← {} for R P ( θ ) in S do if R P commutes with all rotations in G then G.insert ( R P ( θ )) continue end if L ← L :: G G ← { R P ( θ ) } end for L ← L :: G return L end procedure Benchmarks
In order to benchmark our method we picked three representative architectures: Rigetti’s As-pen chip (16 qubits), IBM’s Melbourne chip (14 qubits), and a fictive all-to-all (14 qubits)architecture. The idea being that Melbourne’s connectivity is close to a grid, whereas Aspen’sconnectivity contains longer cycles and has a less regular structure. The all-to-all architectureis here to act as a baseline in the benchmarks. The connectivity graphs are described in Figure5. (a) 0 1 2 3 4 5 613 12 11 10 9 8 7 (b) 0 123456 7 8 91011121314 15Figure 5: (a) IBM’s Melbourne and (b) Rigetti’s Aspen connectivity graph
We benchmarked three algorithms: • Hirata et al
SWAP insertion algorithm (generalized to arbitrary connectivity, search depthof 4), denoted swap in the various benchmarks, • lazy synthesis using linear boolean operators (depth of 3), denoted linear in the bench-marks, • lazy synthesis using Clifford operators (depth of 3), denoted clifford in the benchmarks, • lazy synthesis using Clifford operators (depth of 3) with the additional reordering of Paulirotation, denoted clifford (cid:63) in the benchmarks, • lazy synthesis using Clifford operators (depth of 3) with the additional merging of Paulirotation, denoted clifford † in the benchmarks, • lazy synthesis using Clifford operators (depth of 3) with the additional merging and re-ordering of Pauli rotation, denoted clifford (cid:63) † in the benchmarks,on four sets of quantum circuits: • A set of random circuits parameterized by their Clifford gate density (see figure 6). Thisparameterization helps predict the performances of our methods when applied to otherfamilies of circuits. See below for a description of the random generation process. • A collection of standard circuits taken from [AAM18] that fit on 14 qubits. Circuits aresimply pre-processed by replacing Toffoli gates by a standard CNOT + T decomposition.Tables 1, 2, 3 provides the final CNOT counts and the relative CNOT overhead for thethree hardware models. • A set of random QAOA instances of MAX-k-LIN-2 (depth 1). These circuits are basicallyphase polynomials with uniform Hamming weights equal to k . The circuits are generatedusing a naive strategy and produce a large amount of CNOTs. Their Clifford densityroughly grows as k − k − (neglecting the final layer of non-Clifford X rotations and theinitial Walsh-Hadamard transform). • A set of random products of arbitrary Pauli rotations. These present roughly the same sta-tistical features as standard quantum chemistry/material Ans¨atze. These circuits usuallyexhibit quite large Clifford densities ( > . Why no other SWAP insertion algorithms?
We also tried to include other SWAP in-sertion algorithms (namely SABRE [LDX18], and A ∗ based approach [ZPW17]), but both of hese methods performed systematically worse than Hirata et al approach generalized to ar-bitrary connectivity (the algorithm described in section 3). Moreover, the execution time ofthe A ∗ approach can sometimes become prohibitive, which makes it unpractical for realisticapplications. Random generation process.
Our random circuit generation process is parameterized by anumber of qubits n and a Clifford density parameter p . Each circuit contains n gates. Foreach gate, with probability 1 − p , we insert a non-Clifford Z rotation on a random qubit. Else,with probability , we insert a random CNOT else a random 1-qubit Clifford gate. Thesecircuits have roughly the same number of CNOTs and 1-qubit Cliffords as naively implementedVQE/QAOA Ans¨atze and are therefore representative of typical variational quantum circuits. Figure 6: Benchmarks for random circuits. Circuits are randomly generated using all the n qubitsof the target architecture, n gates and a fixed Clifford density (see section 6). Each point isgenerated by compiling 100 random circuit. Random circuits.
The simplest set of benchmarks to explain is the one over random circuits.When increasing the Clifford density, the average number of entangling gates grows, leadingto a growing linear overhead for the SWAP insertion approach. The approach based on linearboolean operator eventually outperform the SWAP insertion technique when the Clifford densitybecomes large due to the increased proportion of CNOT gates. This approach still requires asynthesis every time a non-CNOT gate is met, hence the large overhead. As expected, the lifford based approach quickly outperforms both other approaches when Clifford gates becomepredominant in the circuit since it will have less and less operator to synthesize. Notice howthis approach ultimately achieves a compression (i.e negative CNOT overhead). The qualitativebehavior is comparable for Melbourne and Aspen connectivities. For the all-to-all connectivity,the SWAP insertion is trivial, hence omitted. Interestingly, the Clifford approach is significantlyworse than doing nothing in that case, up until the Clifford density is larger than about 85%,in which case it starts compressing the circuit. Our rotation merging pre-optimization strictlyoutperforms the SWAP insertion approach in the two constrained architectures (see clifford † and clifford (cid:63) † ). Table 1: Compilation of a collection of standard circuits for Melbourne architecture. circuit init swap linear clifford clifford (cid:63) clifford † clifford (cid:63) † tof 3 18 116.7% 39 72.2% 31 50.0% 27 61.1% 29 0.0% 18 11.1% 20barenco tof 3 24 75.0% 42 25.0% 30 4.2% 25 8.3% 26 -41.7% 14 -41.7% 14mod5 4 28 117.9% 61 35.7% 38 7.1% 30 -28.6% 20 -25.0% 21 -35.7% 18tof 4 30 110.0% 63 76.7% 53 33.3% 40 56.7% 47 -3.3% 29 0.0% 30tof 5 42 135.7% 99 200.0% 126 176.2% 116 83.3% 77 119.0% 92 64.3% 69qft 4 46 176.1% 127 45.7% 67 26.1% 58 4.3% 48 -30.4% 32 -32.6% 31barenco tof 4 48 112.5% 102 131.2% 111 29.2% 62 70.8% 82 -31.2% 33 -20.8% 38mod mult 55 48 337.5% 210 341.7% 212 202.1% 145 141.7% 116 145.8% 118 131.2% 111vbe adder 3 70 107.1% 145 30.0% 91 -11.4% 62 -32.9% 47 -55.7% 31 -52.9% 33barenco tof 5 72 112.5% 153 123.6% 161 134.7% 169 61.1% 116 6.9% 77 5.6% 76rc adder 6 93 167.7% 249 48.4% 138 45.2% 135 30.1% 121 -14.0% 80 -4.3% 89gf2ˆ 4 mult 99 209.1% 306 263.6% 360 202.0% 299 63.6% 162 96.0% 194 22.2% 121mod red 21 105 185.7% 300 150.5% 263 112.4% 223 81.0% 190 49.5% 157 24.8% 131hwb6 116 196.6% 344 144.8% 284 62.1% 188 61.2% 187 23.3% 143 18.1% 137grover 5 288 116.7% 624 158.7% 745 207.3% 885 174.7% 791 33.3% 384 6.6% 307hwb8 7129 227.0% 23311 126.5% 16144 154.0% 18110 127.9% 16246 77.5% 12654 76.4% 12574 Table 2: Compilation of a collection of standard circuits for Aspen architecture. circuit init swap linear clifford clifford (cid:63) clifford † clifford (cid:63) † tof 3 18 116.7% 39 72.2% 31 50.0% 27 61.1% 29 0.0% 18 11.1% 20barenco tof 3 24 75.0% 42 25.0% 30 4.2% 25 8.3% 26 -41.7% 14 -41.7% 14mod5 4 28 117.9% 61 35.7% 38 7.1% 30 -28.6% 20 -25.0% 21 -35.7% 18tof 4 30 110.0% 63 76.7% 53 33.3% 40 56.7% 47 -3.3% 29 0.0% 30tof 5 42 171.4% 114 159.5% 109 119.0% 92 157.1% 108 54.8% 65 81.0% 76qft 4 46 176.1% 127 45.7% 67 26.1% 58 4.3% 48 -30.4% 32 -32.6% 31barenco tof 4 48 112.5% 102 131.2% 111 29.2% 62 58.3% 76 -31.2% 33 -20.8% 38mod mult 55 48 218.8% 153 331.2% 207 133.3% 112 64.6% 79 52.1% 73 60.4% 77vbe adder 3 70 145.7% 172 67.1% 117 32.9% 93 25.7% 88 -30.0% 49 -27.1% 51barenco tof 5 72 137.5% 171 287.5% 279 152.8% 182 152.8% 182 45.8% 105 41.7% 102rc adder 6 93 190.3% 270 153.8% 236 80.6% 168 38.7% 129 37.6% 128 79.6% 167gf2ˆ 4 mult 99 254.5% 351 367.7% 463 284.8% 381 119.2% 217 139.4% 237 72.7% 171mod red 21 105 174.3% 288 236.2% 353 125.7% 237 108.6% 219 57.1% 165 59.0% 167hwb6 116 178.4% 323 143.1% 282 66.4% 193 47.4% 171 13.8% 132 4.3% 121grover 5 288 119.8% 633 228.1% 945 186.5% 825 110.4% 606 25.3% 361 7.3% 309hwb8 7129 206.2% 21829 180.1% 19970 203.6% 21642 160.7% 18585 65.5% 11798 60.3% 11430 circuit init linear clifford clifford (cid:63) clifford † clifford (cid:63) † tof 3 18 0.0% 18 -33.3% 12 -44.4% 10 -61.1% 7 -61.1% 7barenco tof 3 24 -8.3% 22 -12.5% 21 -41.7% 14 -50.0% 12 -54.2% 11mod5 4 28 -10.7% 25 -35.7% 18 -46.4% 15 -57.1% 12 -57.1% 12tof 4 30 -3.3% 29 -13.3% 26 -30.0% 21 -43.3% 17 -53.3% 14tof 5 42 -2.4% 41 -14.3% 36 -33.3% 28 -38.1% 26 -33.3% 28qft 4 46 -32.6% 31 -28.3% 33 -41.3% 27 -52.2% 22 -58.7% 19barenco tof 4 48 -4.2% 46 -2.1% 47 -22.9% 37 -41.7% 28 -47.9% 25mod mult 55 48 -6.2% 45 -14.6% 41 -25.0% 36 -47.9% 25 -39.6% 29vbe adder 3 70 -18.6% 57 -21.4% 55 -51.4% 34 -70.0% 21 -65.7% 24barenco tof 5 72 -8.3% 66 -4.2% 69 -27.8% 52 -37.5% 45 -48.6% 37rc adder 6 93 -4.3% 89 -2.2% 91 -15.1% 79 -28.0% 67 -14.0% 80gf2ˆ 4 mult 99 23.2% 122 35.4% 134 -20.2% 79 -35.4% 64 -39.4% 60mod red 21 105 0.0% 105 37.1% 144 -21.9% 82 -7.6% 97 -24.8% 79hwb6 116 -2.6% 113 8.6% 126 -16.4% 97 -36.2% 74 -35.3% 75grover 5 288 -5.2% 273 10.1% 317 4.5% 301 8.7% 313 2.1% 294hwb8 7129 -2.7% 6939 65.7% 11810 20.1% 8562 9.0% 7769 1.0% 7197 Standard circuits.
Without surprises, the Clifford based approach outperforms almost sys-tematically the other two approaches in the case of limited connectivity. Interestingly, for anall-to-all connectivity, the linear based approach seems to behave well and achieves CNOT countreduction where the Clifford approach fails to (see grover 5 and hwb8). On this class of cir-cuits, our pre-optimizations largely reduce the compilation overhead. Since these circuits arecomposed of Clifford and T gates, merging rotations is really beneficial. Most merges produceClifford rotations that do not contribute to CNOT overhead in the compiled circuit.
MAX-k-LIN-2 and random Pauli sequences.
For both of these benchmarks the inputcircuits have a quite high Clifford density, since they are based on an initial naive implemen-tation of a sequence of Pauli rotations. It is interesting to notice that since the MAX-k-LIN-2circuits roughly correspond to phase polynomials, the linear and Clifford approaches have verycomparable behaviors. The Clifford approach, however, can benefit from the rotation reorder-ing optimization. With this optimization, the Clifford approach becomes overwhelmingly betterthan the SWAP or linear approach. Notice that the first point(s) of the graph (Hammingweight of 2) corresponds exactly to MAX-CUT QAOA circuits which are often taken as stan-dard circuits for NISQ era applications. On these circuits, our method with rotation reordering,presents a two fold improvement compared to the SWAP insertion approach. In the randomPauli setting, the Clifford approach, by itself, systematically beats the two other approaches.The rotation reordering optimization does not bring significant improvements compared to thestandard Clifford approach.
Beyond Clifford.
It is not clear how to extend this approach to groups larger than the Cliffordgroup. It might be worth to investigate the higher level of the Clifford hierarchy and deviseextraction routines for these of operators, even though they do not exhibit the same propertiesas the Clifford group.
Gaussian operators.
A potential candidate is the group of Gaussian operators. This groupcorresponds to operators that can be implemented via circuits of matchgates . It has the nicefeature of stabilizing the hierarchy of Hamiltonians that are bounded degree polynomials over a
Clifford algebra [JM08]. It happens that these types of operators are the main ingredient usedto construct UCCSD Ans¨atze for Fermionic dynamics (including VQE circuits for quantumchemistry or material science). For instance, in the quantum chemistry setting, this resultentails that one can pull all single excitation terms of an Ansatz to the end of the Ansatz, andconjugate the final Hamiltonian with these terms. The resulting circuit will have a reducednumber of terms to implement, but these terms might be harder to implement. Hence it is notobvious that one can gain anything by synthesizing these via a naive approach.
In this present work, we only developed algorithms that try to reduce the overall CNOT countof the output circuit. It is of course possible to change the metric to take into account moreaspects of the final circuit. A good start would be to use finer hardware models and (roughly)compute the fidelity of each produced sub-circuit, picking the most faithful one. This simpleapproach has been proven to improve the overall circuit fidelity compared to straightforwardgate count minimization in the SWAP insertion setting. Similarly one can also aim at reducingentangling depth instead of entangling gate count.
In this work, we take a local (with finite depth search) approach to tackle the problem of (re-)synthesis of a sequence of Pauli rotations. It could be interesting to apply global techniquesfor the synthesis of groups of commuting rotations to solve this problem. This would probablylead to better results for standard Clifford + T circuits. It remains unclear if these approachescan behave well for NISQ era circuits. A recent work by Gheorghiu et al [GLMM20] tacklesthe problem by extracting and re-synthesizing phase polynomials out of the input circuit. Thisapproach seems to perform quite well on some circuits and far worse than our method onothers. For instance, Table 4 sums up the performances of their two splitting heuristics and ourSWAP and Clifford based compilers on a 3 × × (cid:63) † CNOT-OPT-A CNOT-OPT-Btof 5 42 128.6% 11.9% 140.82% 138.78%mod mult 55 48 168.8% 14.6% 321.82% 203.64%barenco tof 5 72 125.0% -36.1% 245.24% 140.48%grover 5 288 108.3% 3.8% 116.67% 105.36%
We presented a meta-heuristic called lazy synthesis that exploits efficient representations ofelements in a subgroup of the unitary group in order to compile an input quantum circuitinto an architecture compliant circuit. We showed how this meta-heuristic can be used toreformulate a standard SWAP insertion algorithm from the literature and produced two newcompilation algorithms based on the partial synthesis of linear Boolean operators and Cliffordoperators. Finally, we ran benchmarks on various classes of circuits, providing evidence thatthese algorithms are competitive in a NISQ setting.While our algorithms seems to be well behaved on NISQ oriented quantum circuits, it remainsunclear of their scalability to tackle very large Clifford + T quantum circuits. It is very likelythat their inherently local structure will hinder performances on large circuits.
Acknowledgments
This work was supported in part by the French National Research Agency (ANR) under theresearch project SoftQPRO ANR-17-CE25-0009-02, and by the DGE of the French Ministry ofIndustry under the research project PIAGDN/QuantEx P163746-484124.
References [AAM18] Matthew Amy, Parsiad Azimzadeh, and Michele Mosca. On the controlled-notcomplexity of controlled-not–phase circuits.
Quantum Science and Technology ,4(1):015002, 2018.[AG04] Scott Aaronson and Daniel Gottesman. Improved simulation of stabilizer circuits.
Physical Review A , 70(5), Nov 2004.[AG20] Matthew Amy and Vlad Gheorghiu. staq—a full-stack quantum processing toolkit.
Quantum Science and Technology , 5(3):034016, Jun 2020.[BM20] Sergey Bravyi and Dmitri Maslov. Hadamard-free circuits expose the structure ofthe clifford group. arXiv preprint arXiv:2003.09412 , 2020.[CSU19] Andrew M. Childs, E. Schoute, and Cem M. Unsal. Circuit transformations forquantum architectures.
ArXiv , abs/1902.09102, 2019.[dB11] Niel de Beaudrap. A linearized stabilizer formalism for systems of finite dimension,2011.[dBBV +
20] Timoth´ee Goubault de Brugi`ere, Marc Baboulin, Benoˆıt Valiron, Simon Martiel,and Cyril Allouche. Quantum cnot circuits synthesis for nisq architectures usingthe syndrome decoding problem. In
International Conference on Reversible Com-putation , pages 189–205. Springer, 2020.[GLMM20] Vlad Gheorghiu, Sarah Meng Li, Michele Mosca, and Priyanka Mukhopadhyay.Reducing the cnot count for clifford+t circuits on nisq architectures, 2020. HNYN11] Yuichi Hirata, Masaki Nakanishi, Shigeru Yamashita, and Yasuhiko Nakashima. Anefficient conversion of quantum circuits to a linear nearest neighbor architecture.
Quantum Information & Computation , 11(1&2):142–166, 2011.[JM08] Richard Jozsa and Akimasa Miyake. Matchgates and classical simulation of quan-tum circuits.
Proceedings of the Royal Society A: Mathematical, Physical and En-gineering Sciences , 464(2100):3089–3106, Jul 2008.[KMS07] Samuel A Kutin, David Petrie Moulton, and Lawren M Smithline. Computationat a distance. arXiv preprint quant-ph/0701194 , 2007.[KvdG19] Aleks Kissinger and Arianne Meijer van de Griend. Cnot circuit extraction fortopologically-constrained quantum memories, 2019.[LDX18] Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit mapping problem fornisq-era quantum devices, 2018.[Lit19] Daniel Litinski. Magic state distillation: Not as costly as you think.
Quantum ,3:205, Dec 2019.[NGM20] Beatrice Nash, Vlad Gheorghiu, and Michele Mosca. Quantum circuit optimizationsfor nisq architectures.
Quantum Science and Technology , 5(2):025010, Mar 2020.[PMH08] Ketan N Patel, Igor L Markov, and John P Hayes. Optimal synthesis of linearreversible circuits.
Quantum Information & Computation , 8(3):282–294, 2008.[Pre18] John Preskill. Quantum computing in the nisq era and beyond.
Quantum , 2:79,Aug 2018.[SSP13] A. Shafaei, M. Saeedi, and M. Pedram. Optimization of quantum circuitsfor interaction distance in linear nearest neighbor architectures. In , pages 1–6, 2013.[vdBT20] Ewout van den Berg and Kristan Temme. Circuit optimization of hamiltoniansimulation by simultaneous diagonalization of pauli clusters.
Quantum , 4:322, Sep2020.[vdGD20] Arianne Meijer van de Griend and Ross Duncan. Architecture-aware synthesis ofphase polynomials for nisq devices, 2020.[ZPW17] Alwin Zulehner, Alexandru Paler, and Robert Wille. An efficient methodology formapping quantum circuits to the ibm qx architectures, 2017.
A Dealing with the final operator
In this section we detail how to classically emulate any final non-trivial Clifford operator. Thisencompasses the case of a final permutation or linear operator, even though the case of a finalpermutation can be trivially dealt with.
A.1 Expected value of some observable
In this setting, we assume that we are given as input, both the circuit C in to compile andsome final observable H to evaluate and the end of the circuit execution. In short, we need tocompute: (cid:104) | C † in HC in | (cid:105) Using either the linear operator synthesis approach of the Clifford approach, we end up producinga circuit C out and a final linear/Clifford operator A such that: (cid:104) | C † in HC in | (cid:105) = (cid:104) | C † out A † HAC out | (cid:105) Lets further assume that H is given to us in the Pauli basis. That is: H = (cid:88) i α i P i ith α i some real coefficients, and P i ∈ P n some Pauli operators. Then, sampling the newobservable A † HA = (cid:80) i α i A † P i A = (cid:80) i α i P (cid:48) i on the output circuit is equivalent to sampling theinput observable on the input circuit.Sampling this observable using the standard techniques of co-diagonalization of its terms isno more costly (in terms of shots) than sampling the original H . A.2 Sampling bit-strings
If we are required to provide some samples taken according to the final distribution induced by C in | (cid:105) , things are bit trickier.Our algorithms output a pair C out , A such that AC out = C in and we would like to emulatesampling of AC out | (cid:105) = C in | (cid:105) . To do so we proceed as follows.Defining Z = { Z i , i ∈ [ n ] } the set of local Z operators on each qubit. Sampling bit-stringsout of some quantum state over n qubits boils down to iteratively evaluating the value of theseoperators in any order (since they commute). We would like to evaluate this collection ofoperators on state AC out | (cid:105) . This is equivalent to evaluating the collection of Pauli operators A † Z A = { P i = A † Z i A, i ∈ [ n ] } on state C out | (cid:105) . These operators, however, might not bediagonal operators, and thus cannot be directly evaluated using standard computational basismeasurements. Nevertheless, these operators commute with one another since the Z i commute.Hence, one can co-diagonalize them using a Clifford circuit C diag . By construction, the newcollection of Pauli operators { s i Q i = C diag P i C † diag , i ∈ [ n ] } are diagonal operators (henceproducts of Z j and I j ) times some phase s i = ±
1. Sampling these operators on state C diag C out | (cid:105) is, by construction, equivalent to evaluating operators in Z over state AC out | (cid:105) . Figure 9 depictsthis sequence of conjugations.Since the Q i are products of Z j and I j operators, they can be seen as computing paritiesover a subset of qubits. This gives us a simple algorithm to fix measurement results. We cansample some bit-string w out of the quantum state C diag C out | (cid:105) and output a new bit-string w (cid:48) with w (cid:48) i = δ − s i ⊕ (cid:80) j ∈ Q i w j where the sum is modulo 2. This operation boils down to applyingan affine system over F n described by the ( s i , Q i ) operators.For example, let’s assume that we need to sample bit-strings over 2 qubits. Let’s assumethat after conjugation through A and co-diagonalization, we get operators Q = Z ⊗ Z , s = − Q = Z ⊗ I , s = 1. These operators can be summed up via the following affine systemover F : x (cid:55)→ Lx + b with L = (cid:18) (cid:19) and b = (1 , T . Any bit-string sampled from state C diag C out | (cid:105) can be fixedby applying L and adding b : 00 (cid:55)→ (cid:55)→ (cid:55)→ (cid:55)→ C out A Z Z ...... Z n (b) C out P P · · · P n A · · ·· · · (c) C out C diag s Q s Q · · · s n Q n C † diag A · · ·· · · Figure 9: The sampling fixing procedure. (a) we need to emulate sampling of the quantum state AC out | (cid:105) = C in | (cid:105) . This sampling procedure relies on the joint measurements of operators Z i foreach qubit i . (b) Since A is Clifford, we can commute the Z i with A , yielding a collection ofcommuting operators P i (non necessarily diagonal). (c) These operators can be jointly measuredby co-diagonalizing them via a Clifford circuit C diag , yielding a collection of diagonal operators s i Q i where Q i are products of Z operators and s i = ± s i Q i are measured asimple linear system inversion allows us to emulate the sampling of the initial Z i operators. Hence,in practice, only C out and C diag are effectively performed on the quantum processor. Remark on co-diagonalization.
To the best of our knowledge, [vdBT20] provides the bestapproach to produce a co-diagonalization circuit. They show that this task can be reduced tothe synthesis of a linear Boolean operator which has a worst case complexity of O ( n / log n ) inthe case of non constrained architecture. A simple application of any architecture aware CNOTsynthesis heuristics, like [dBBV + C diag .This argument is essential to claim that, in most of the cases, it is more efficient to synthesize C diag rather than directly synthesizing A . Indeed, the synthesis of an arbitrary Clifford operatoris usually done by the synthesis of successive layers of Hadamard gates, Phase gates, CNOTgates or CZ gates. Most recent results show that three layers of two-qubit gates are necessary[BM20], making it more affordable to use the co-diagonalization process. B Recursive search of finite depth
In the three algorithms described in this paper, the extraction functions perform a recursivesearch over the next w calls to the extraction in order to locally pick the subcircuit that willgenerate the least extraction overhead. This recursive search is introduced in [HNYN11] forSWAP insertion and can be easily transposed to the linear Boolean operator and Clifford setting.In practice, this is done by computing a tree of depth w containing all the possible choicesand associating to each leaf of this tree the sum of all sub-circuits scores on the path from theroot to the leaf. In the algorithms presented in this paper, we simply used the CNOT countas a metric, but any other metric, such as overall fidelity, or depth can be used in this search.Figure 10 depicts two search trees of depth 0 and 1. a b c +6 +6 a b +3 a +6 +8 a b +12 +12 +9 +12 +14Figure 10: Search trees of depth 0 and 1. In the first tree, we stop the recursive search at depth 0.In this situation all possible choices a , b ,and c are equivalent since they produce sub-circuitsof score 6. Hence, we greedily pick the first choice a . After exploring at depth 1, we notice thatchoosing option bb