[PDF] Differentiable Quantum Architecture Search

Abstract

Quantum architecture search (QAS) is the process of automating architecture engineering of quantum circuits. It has been desired to construct a powerful and general QAS platform which can significantly accelerate current efforts to identify quantum advantages of error-prone and depth-limited quantum circuits in the NISQ era. Hereby, we propose a general framework of differentiable quantum architecture search (DQAS), which enables automated designs of quantum circuits in an end-to-end differentiable fashion. We present several examples of circuit design problems to demonstrate the power of DQAS. For instance, unitary operations are decomposed into quantum gates, noisy circuits are re-designed to improve accuracy, and circuit layouts for quantum approximation optimization algorithm are automatically discovered and upgraded for combinatorial optimization problems. These results not only manifest the vast potential of DQAS being an essential tool for the NISQ application developments, but also present an interesting research topic from the theoretical perspective as it draws inspirations from the newly emerging interdisciplinary paradigms of differentiable programming, probabilistic programming, and quantum programming.

Full PDF

DDiﬀerentiable Quantum Architecture Search

Shi-Xin Zhang,

1, 2

Chang-Yu Hsieh, ∗ Shengyu Zhang, and Hong Yao

1, 3, † Institute for Advanced Study, Tsinghua University, Beijing 100084, China Tencent Quantum Laboratory, Tencent, Shenzhen, Guangdong, China, 518057 State Key Laboratory of Low Dimensional Quantum Physics, Tsinghua University, Beijing 100084, China (Dated: October 20, 2020)Quantum architecture search (QAS) is the process of automating architecture engineering ofquantum circuits. It has been desired to construct a powerful and general QAS platform whichcan signiﬁcantly accelerate current eﬀorts to identify quantum advantages of error-prone and depth-limited quantum circuits in the NISQ era. Hereby, we propose a general framework of diﬀerentiablequantum architecture search (DQAS), which enables automated designs of quantum circuits in anend-to-end diﬀerentiable fashion. We present several examples of circuit design problems to demon-strate the power of DQAS. For instance, unitary operations are decomposed into quantum gates,noisy circuits are re-designed to improve accuracy, and circuit layouts for quantum approximationoptimization algorithm are automatically discovered and upgraded for combinatorial optimizationproblems. These results not only manifest the vast potential of DQAS being an essential tool forthe NISQ application developments, but also present an interesting research topic from the theo-retical perspective as it draws inspirations from the newly emerging interdisciplinary paradigms ofdiﬀerentiable programming, probabilistic programming, and quantum programming.

In the noisy intermediate-scale quantum technology(NISQ) era [1], the hybrid quantum-classical (HQC) com-putational scheme, combining quantum hardware evalu-ations with classical optimization outer loops, is widelyexpected to deliver the ﬁrst instance of quantum advan-tages (for certain non-trivial applications) in the absenceof fault-tolerant quantum error corrections. Several pro-totypical examples in this category include ﬁnding theground state of complex quantum systems by variationalquantum eigensolver (VQE) [2–4], approaching betterapproximation for NP hard combinatorial optimizationproblems by quantum approximation optimization algo-rithms (QAOA) [5–7], and solving some learning tasks ineither the classical or quantum context by the quantummachine learning (QML) setup [8–12].Under the typical setting in the HQC computationalparadigm, the structure of variational ansatz is held ﬁxedand only trainable parameters are optimized to satisfy anobjective function. This lack of ﬂexibility is rather unde-sirable as diﬀerent families of parametrized circuits maydiﬀer substantially in their expressive power and entan-gling capability [13, 14]. Moreover, in the NISQ era, athoughtful circuit design should minimize the consump-tion of quantum resources due to decoherence and poorconnectivity among qubits in current quantum hard-wares. For instance, the number of two-qubits gates(or the circuit depth) should be minimized to reducenoise-induced errors. Additional error mitigation strat-egy should be conducted without using extra qubits ifpossible. With these requirements in mind, the designof an eﬀective circuit ansatz should take into account ofthe nature of the computational problems and the spec-iﬁcations of a quantum hardware as well. We term the ∗ [email protected] † [email protected] automated design of parameterized circuits, in the afore-mentioned setting, as quantum ansatz search (QAS).In a broader context, we denote QAS as quantum ar-chitecture search, which covers all scenarios of quantumcircuit design and goes beyond the design of a variationalansatz for HQC algorithms. QAS can facilitate a broadrange of tasks in quantum computations. Its applicationsinclude but not limited to decomposing arbitrary unitary[15] into given quantum gates, ﬁnding possible shortcutsfor well-established quantum algorithms [16, 17], explor-ing optimal quantum control protocols [18–20], searchingfor powerful and resource-eﬃcient variational ansatz [21],and designing end-to-end and task-speciﬁc circuits whichalso incorporate considerations on quantum error mitiga-tion (QEM), native gate set, and topological connectivityof a speciﬁc quantum hardware [22, 23].Neural architecture search (NAS) [24], devoted to thestudy and design of neural networks shares many simi-larities with designing parameterized quantum circuits.The common approaches for NAS include greedy algo-rithms [25], evolutionary or genetic algorithms [26–30],and reinforcement learning (RL) based methods [31–34].It is interesting to witness that the progress in QAS fol-lows closely the ideas presented in NAS. Recent workson quantum circuit structure or ansatz design also ex-ploited greedy methods [35–37], evolutional or geneticmethodologies [16, 17, 21–23, 38] and RL engine basedapproaches [19, 39] for tasks such as quantum control,QEM or circuit ansatz searching [40].Recently, diﬀerentiable neural architecture search(DARTS) has been proposed [41] and further reﬁnedwith many critical improvements and generalizations [42–47]. The key idea of a diﬀerentiable architecture searchis to relax the discrete search space of neural architec-tures onto a continuous and diﬀerentiable domain, ren-dering much faster end-to-end NAS workﬂow than pre-vious methods. Due to the close relation between NAS a r X i v : . [ qu a n t - ph ] O c t and QAS, it is natural to ask whether it is possible to de-vise a diﬀerentiable quantum architecture search (DQAS)incorporating DARTS-like ideas. Our answer is aﬃrma-tive; as presented in this work, we constructed a gen-eral framework of DQAS that works very well as a uni-versal and fully automated design tool for quantum cir-cuits. As a general framework sitting at the intersec-tion of newly emerging interdisciplinary paradigms of dif-ferentiable programming, probabilistic programming andquantum programming, DQAS is of both high theoreti-cal and practical values across various ﬁelds in quantumcomputing and quantum information processing. Results

Circuit encoding and operation pool.

Any quantumcircuit is composed of a sequence of unitaries with orwithout trainable parameters, i.e. U = p (cid:89) i =0 U i ( θ i ) , (1)where θ can be of zero length when U i is a ﬁxed unitarygate. Hence, we formulate the framework to cover circuit-design tasks beyond searching of variational ansatz.In the most general term, these U i can represent anone-qubit gate, a two-qubit gate or a higher-level blockencoding, such as e iHθ with a pre-deﬁned HermitianHamiltonian H . This set of possible unitaries U i con-stitutes the operation pool for DQAS, and the algorithmattempts to assemble a quantum circuit by stacking U i together in order to optimize a task-dependent objectivefunction.We denote the choice of primitive unitary gates in theoperation pool along with the circuit layout of these gatesas a circuit encoding. In the operation pool, there are c diﬀerent unitaries V i , and each placeholder U i should beassigned one of these V s by DQAS. In this work, we referto the placeholder U i as the i-th layer of the circuit U ,no matter such placeholder actually stands for layers orother positional labels. The circuit design comes withreplacement: one V i from the operation pool can be usedmultiple times in building a single circuit U . Objectives.

To enable an end-to-end circuit design, anoptimizable objective should be speciﬁed. Such objec-tives are typically just sum of expectation values of someobservables for hybrid quantum-classical scenarios, suchas combinatorial optimization problems or quantum sim-ulations. Namely, the objective in these cases reads L = (cid:104) | U † HU | (cid:105) , (2)where H is some Pauli strings such as H = − (cid:80) (cid:104) ij (cid:105) Z i Z j for MAX CUT problems and | (cid:105) represents the directproduct state . This loss function L can be easily esti-mated by performing multiple shots of sample measure-ments in quantum hardwares. However, the objectives can assume more general formsfor a HQC algorithm. For instance, one may deﬁnemore sophisticated objectives that not only depend onthe mean value of measurements but also depend on dis-tributions of diﬀerent measurements. Examples includeCVaR [48] and Gibbs objective [37], proposed to im-prove quality of solutions in QAOA. In general, DQAS-compatible objectives for HQC algorithms assume thefollowing form, L = (cid:88) i g i ( (cid:104) | U † f i ( H i ) U | (cid:105) ) , (3)where f i and g i are diﬀerentiable functions and H i areHermitian observables.Extending the HQC algorithm to supervised machinelearning models, the objective function has to be furthergeneralized to incorporate quantum-encoded data | ψ j (cid:105) with corresponding label y j , L = (cid:88) j (cid:32)(cid:88) i g i ( (cid:104) ψ j | U † f i ( H i ) U | ψ j (cid:105) ) − y j (cid:33) . (4)Beyond ansatz searching for HQC algorithms, DQAScan be used to design circuits for state preparation orcircuit compilations. In these scenarios, the objective isoften taken as the ﬁdelity between the proposed circuitdesign and a reference circuit, and the objective for purestates now reads L = (cid:88) j (cid:104) φ j | U | ψ j (cid:105) , (5)where | φ j (cid:105) = U ref | ψ j (cid:105) is the expected output of a refer-ence circuit when | ψ j (cid:105) is the input state. For a state-preparation setup, the objective above is reduced to L = (cid:104) φ | U | (cid:105) , where | φ (cid:105) is the target state. For a generaltask of unitary learning or compilation, the dimension of | ψ j (cid:105) can be as large as 2 n where n is the qubits num-ber, such condition may be relaxed by sampling randominputs from Haar measure [49], which follows the philos-ophy of machine learning, especially stochastic batchedgradient descent. DQAS setup.

The goal of DQAS is to assign p unitaries(selected from the operation pool) to the placeholder U i in order to construct a circuit U that minimizes an ob-jective L ( U ). In addition, DQAS not only determines apotentially optimal circuit layout but also identiﬁes suit-able parameters if any of U i carries trainable parameters θ i .To facilitate the architecture search, it is tempting torelax the combinatorial problem into a continuous do-main, amenable to optimization via gradient descent. Wepropose to embed the discrete structural choices into acontinuously-parameterized probabilistic model. For in-stance, we consider a probabilistic model P ( k , α ) where k is the discrete structural parameter determining thequantum circuits structure. For example, if k = [1 , , U ( k ) = V V V where V and V referto elements in the predeﬁned operation pool introducedearlier. In the context of Eq. (1), U = V , U = V and U = V .In summary, discrete random variables k are sampledfrom a probabilistic model characterized by parameters α . A particular k determines the structure of the circuit U ( k ), and this circuit is used to evaluate the objectives L ( U ). The ﬁnal objective L depends indirectly on bothvariational parameters θ and probabilistic model param-eters α , which can be both trained via gradient descentusing automatic diﬀerentiation.Since DQAS needs to sample multiple circuits U be-fore deciding whether the current probabilistic model isideal, we adopt the circuit parameter reusing mechanismfor parametrized V i in the operation pool. In other words,we store a matrix of parameters with size p × c × l , where p is the total number/layer of unitaries to build the cir-cuit, c is the size of operation pools and l is the numberof parameters for each unitaries in the operation pool,we denoted this as a circuit parameter pool. Therefore,every sampled parametrized V i should be initialized with l parameters taken from the circuit parameter pool de-pending on the placeholder index and its operation-poolindex i .The ﬁnal end-to-end objectives for DQAS reads L = (cid:88) k ∼ P ( k , α ) L ( U ( k , θ )) . (6)We recapitulate DQAS as Algorithm 1 with a visual-ized workﬂow in Fig. 1. Algorithm 1

Diﬀerentiable Quantum Architecture Search.

Require: p as the number of components to build the cir-cuit; operation pool with c distinct unitaries; probabilisticmodel and its parameters α with shape p × c initializedall to zero; resuing parameter pool θ initialized with giveninitializer with shape p × c × l , where l is the max numberof parameters of each op in operation pool. while not converged do Sample a batch of size K circuits from model P ( k, α ). Compute the objective function for each circuit inthe batch in the form of Eq. (2), Eq. (4), Eq. (5)depending on diﬀerent problem settings. Compute the gradient with respect to θ and α according to Eq. (7) and Eq. (8), respectively. Update θ and α using given gradient based opti-mizers and learning schedules. end while Get the circuit architecture k ∗ with the highest probabil-ity in P ( k , α ); and ﬁne tuning the circuit parameters as θ ∗ associated with this circuit if necessary. return ﬁnal optimal circuit structure labeled by k ∗ andthe associating weights θ ∗ . Gradients.

DQAS needs to optimize two sets of param-eters, α and θ , in order to identify a potentially ideal Batched circuit evaluation ...

Operation Pool ...

ProbabilisticModel

BatchedsamplesDetermine circuitstructure Reusingvariationalparameters AutomaticDifferentiationGradientDescentEnd-to-endobjective

ParameterPool

FIG. 1. Schematic illustration of DQAS. We sample a batch ofcircuit conﬁgurations for each epoch from probabilistic model P ( k , α ). We then compose corresponding quantum circuitsby ﬁlling in operations and parameters from two pools. Wecan evaluate quantum circuits and compute ﬁnal objective L where α , θ can be adjusted accordingly with gradient basedoptimization method. circuit for the task at hand. The gradients with respectto trainable circuit parameters θ are easy to determine ∇ θ L = (cid:88) k ∼ P ( k , α ) ∇ θ L ( U ( k , θ )) . (7) ∇ θ L ( U ) can be obtained with automatic diﬀerentiationin a classical simulation and from parameter shift [50] orother analytical gradient measurements [51] in quantumexperiments.As explained in Algorithm 1, not all θ parameterswould be present in a circuit which are sampled accord-ing to the probability P ( k , α ) at every iteration. Formissing parameters in a particular circuit, the gradientsare simply set to 0 as anticipated.Calculations of gradients for α should be treated morecarefully, since these parameters are directly related tothe outcomes of the Monte Carlo sampling from P ( k , α ).The calculation of gradient for the Monte Carlo expec-tations is an extensively studied problem [52] with twopossible mainstream solutions: score function estimator[53] (also denoted as REINFORCE [54]) and pathwiseestimator (also denoted as reparametrization trick [55]).In this work, we utilize the score function approach as itis more general and bears the potential to support cal-culations of higher order derivatives if desired [56, 57].For unnormalized probabilistic model, the gradient withrespect to α is given by ∇ α L = (cid:88) k ∼ P ∇ α ln P ( k , α ) L ( U ( k , θ )) − (cid:88) k ∼ P L ( U ( k , θ )) (cid:88) k ∼ P ∇ α ln P ( k , α ) . (8)For normalized probability distributions, (cid:104)∇ α ln P (cid:105) P = 0 and we may simply focus on theﬁrst term. Gradient of ln P can be easily evaluatedvia backward propagations on the given well-deﬁnedprobabilistic model. Probabilistic models.

Throughout this work, we uti-lize the simplest probabilistic models: independent cat-egory probabilistic model. We stress that more compli-cated models such as the energy based models [9, 58, 59]and autoregressive models [60–63] may yield better per-formances under certain settings where explicit correla-tion between circuit layers is important. Such sophisti-cated probabilistic models can be easily incorporated intoDQAS, and we leave this investigation as a future work.The independent categorical probabilistic model weutilized is described as: P ( k , α ) = p (cid:89) i p ( k i , α i ) , (9)where the probability p in each layer is given by a soft-max p ( k i = j, α i ) = e α ij (cid:80) k e α ik , (10)where k i = j means that we pick U i = V j from the op-eration pool, and the parameters α are of the dimension p × c .The gradient of such a probabilistic model can be de-termined analytically, ∇ α ij ln P ( k i = m ) = − P ( k i = m ) + δ jm . (11) Applications.

DQAS is a versatile tool for near-termquantum computations. In the following, we present sev-eral concrete examples to illustrate DQAS’s potential toaccelerate research and development of quantum algo-rithms and circuit compilations in the NISQ era [64]. Ourimplementation are based on quantum simulation back-end of either Cirq [65]/TensorFlow Quantum [66] stackor TensorNetwork [67]/TensorCircuit [68] stack.Firstly, it is natural to apply DQAS to quantum cir-cuits design for state preparation as well as unitary de-composition. For example we can use DQAS to constructexact quantum circuit for GHZ state preparation or Bellcircuit unitary decomposition [40]. We focus on QEMand HQC context in details below to demonstrate thepower of DQAS for NISQ-relevant tasks.

Quantum error mitigation on QFT circuit.

Next, we demonstrate that DQAS can also be applied to de-sign noise resilient circuits that mitigate quantum er-rors during a computation. The strategy we adopt inthis work is to insert single qubit gates (usually Pauligates) into the empty slots in a quantum circuit, wherethe given qubit are supposed to be found in idle/waitingstatus. Such gate-inserting technique can mitigate quan-tum errors since these extra unitaries (collectively actas an identity operation) can turn coherent errors intostochastic Pauli errors, which are easier to handle andeﬀectively reduce the ﬁnal inﬁdelity. Similar QEM tricksare reported in related studies [69, 70].The testbed is the standard circuit for quantumFourier transformation (QFT), as shown in Fig. 2(a). Weassume the following error model for an underlying quan-tum hardware. In between two quantum gates, there is a2% chance of bit ﬂip error incurred on a qubit. When aqubit is in an idle state (with much longer waiting time),there is a much higher chance of about 20% for bit ﬂiperrors. Although the error model is ad-hoc, it does notprevent us from demonstrating how DQAS can automat-ically design noise-resilient circuits.Looking at Fig. 2(a), there are six empty slots in thestandard QFT-3 circuit. Hence, we specify these slotsas p = 6 placeholders for a search of noise-resilient cir-cuits with DQAS. The search ends when DQAS ﬁlls eachplaceholder with a discrete single-qubit gate such thatthe ﬁdelity of the circuit’s output (with respect to the ex-pected outcome) is maximized in the presence of noises.If the operation pool is limited to Pauli gates and iden-tity, { I, X, Y, Z } , then DQAS recommends a rather triv-ial circuit for error mitigation. In short, DQAS ﬁlls thepair gaps (of qubit 0 and qubit 2) with the same Pauligate twice, which together yields an identity, in orderto reduce the error in the gap. As for qubit 1, wherea single gap occurs at the beginning and the end of thecircuit as shown in Fig. 2(a), DQAS simply ﬁlls thesegaps with nothing (identity placeholder). However, if weallow more variety of gates in the operation pool, suchas S gate and T gate, then more interesting circuits canbe found by DQAS. For instance, Fig. 2(b) is one suchexample. In this case, DQAS ﬁlls the two gaps of qubit 1with a T gate each. This circuit cannot be found by thesimpler strategy of inserting unitaries into consecutivegaps. Thus, DQAS provides a systematic and straight-forward approach to identify this kind of long-range cor-related gate assignments that should eﬀectively reducedetrimental eﬀects of noise.We also carried out DQAS on QFT circuit for 4 qubitswith p = 12 circuit gaps as shown in Fig. 3(a). DQASautomatically ﬁnds better QEM architecture which out-performs na¨ıve gate inserting policies again. Fig. 3(b)displays one such example. The interesting patterns oflong-range correlated gate insertions are obvious for quibt2. It is also clear that DQAS learns that more thantwo consecutive gates can combined collectively to ren-der identity such as the three inserted gates for qubit0. Further details on the search for optimal QEM archi- H S T H S H (a)

HZT SZ T H SX HTX (b)

FIG. 2. (a) The basic circuit for QFT on 3 qubits, T gateand H gate in the middle of the circuit can be easily arrangedin the same vertical moment with no gap. And there are sixgaps left (two on each qubit) in this setup. (b) The QEMcircuit for QFT automatically found by DQAS. All slots areﬁlled, DQAS is powerful enough to learn long range correla-tions so that it can ﬁll the gaps on qubit 1 which are seper-ately located. The ﬁdelity between the two circuits on noisyhardwares and the ideal circuit are 0 .

33 and 0 .

6, respectively. tectures and comprehensive comparison on experimentvalues of ﬁnal ﬁdelities can be found in the SupplementalMaterials [40].In summary, DQAS not only learns about insertingpairs of gates as identity into the circuit to mitigate quan-tum error, but also picks up the technique of the long-range correlated gate assignment to further reduce thenoise eﬀects. This result is encouraging and shows howinstrumental DQAS as a tool may be used for design-ing noise-resilient circuits with moderate consumption ofcomputational resource. In this study, we only adapt thesimple gate-insertion policy to design QEM within DQASframework. We expect more sophisticated QEM methodsmay also be adapted to work along with DQAS to iden-tify novel types of noise-resilient quantum circuits. Thisis a direction that we will actively explore in follow-upstudies.

QAOA ansatz searching.

QAOA introduces theadiabatic-process inspired ansatz that stacks alternatingHamiltonian evolution blocks as e − iθH , where H couldbe diﬀerent Hermitian Hamiltonians. To the end of em-ploying DQAS to design parametrized quantum circuitswithin the hybrid quantum-classical paradigm for algo-rithmic developments, we may adopt a higher-level cir-cuit encoding scheme as inspired by QAOA. More specif-ically, the operation pool consists of e − iθH blocks withdiﬀerent Hermitian Hamiltonians and also parameter freelayers of traversal Hadamard gates ⊗ n H . In comparison H S T H

Z^(1/8)

S T H S H (a)

HSZT SSZ T HZ

Z^(1/8)

S T HS SZZ HTZS (b)

FIG. 3. (a) The basic circuit for QFT on 4 qubits, some ofthe gates can be easily arranged in the same vertical momentwith no gap. And there are 12 gaps left in this arrangement.(b) The QEM circuit for QFT automatically found by DQAS.The ﬁdelity between the two circuits on noisy hardwares andthe ideal circuit are 0 .

13 and 0 .

46, respectively. to assemble a circuit by specifying individual quantumgates, this circuit encoding scheme allows a compact andeﬃcient description of large-scale and deep circuits. Forsimplicity, we dub the circuit-encoding scheme above asthe layer encoding.For illustrations, we apply DQAS to designparametrized circuit for the MAXCUT problem inthis subsection in QAOA-like fashion. Aiming to letDQAS ﬁnd ansatz without imposing strong QAOA-typeassumptions on the circuit architecture, we expandthe operation pool with additional Hamiltonians ofthe form ˆ H = − (cid:80) (cid:104) ij (cid:105) O i O j and ˆ H = (cid:80) i O i , where O ∈ { X, Y, Z } ; and we refer to these operations as thexx-layer, rx-layer, rz-layer and so on. In addition, wealso add the transversal Hadamard gates and denote itas the H-layer. All these primitive operations can becompiled into digital quantum gates exactly.Next, let us elaborate on an interesting account thatDQAS automatically re-discovers the standard QAOAcircuit for the MAXCUT problem. To begin, we dis-tinguish two settings: instance learning (for a singleMAXCUT problem) and ensemble learning (for MAX-CUT problems on ensemble of graphs). As noted in [71],the expected outputs by an ensemble of QAOA circuits(deﬁned by graph instances from, say, Erd¨os-R´enyi dis-tributions or regular graph distributions) with ﬁxed vari-ational parameters θ are highly concentrated. The im-plication of such concentration is that the optimal pa-rameters (for an arbitrary instance in the ensemble) canbe quite close to being optimal for the entire ensembleof graph instances. This fact not only increases the sta-bility of the learning process with an ensemble of data L o ss o b j e c t i v e P r o b a b ili t y f o r Q A O A l a y o u t FIG. 4. Metrics on DQAS training (depth p = 5) for MAXCUT problem of degree-3 regular graph ensemble with 8nodes. The upper plot shows the expected energy/averagedcut value in the training process, the loss is approaching − . P = 2 QAOA with H, zz, rx,zz, rx layer arrangement. The lower plot indicates how theprobability of such optimal layout is increased when the prob-abilistic model underlying is updated. inputs, but also makes QAOA more practical when theouter optimization loop can be done in this once-for-allfashion. In this work, we apply DQAS to both instancelearning task and regular graph ensemble learning task[40].For an ensemble learning on regular graph ensemble(node 8, degree 3), we let DQAS search for an optimal cir-cuit design with p = 5. By using the aforementioned op-eration pool comprising the H-layer, rx/y/z-layer and zz-layer with the expected energy for the MAXCUT Hamil-tonian as objective function, DQAS recommends the op-timal circuit with the following layout: H, zz, rx, zz, rxlayers, which coincides exactly with the original QAOAcircuit. For metrics in the searching stage, see Fig. 4.We also carried out DQAS on QAOA ansatz searchingwith multiple objective consideration on hardware detailsas well as double-layer block encoding for operations. Fordetails, see the Supplemental Materials [40]. Reduced graph ansatz searching.

To the end of de-signing shallower circuit than QAOA, another approachworth attempt is to re-deﬁne the primitive circuit layersin the operation pool. For instance, the zz-layer blockis usually generated by the Ising Hamiltonian with thefull connectivity of the MAXCUT problem. However,if the underlying graph of a zz-layer is only a subgraphthen the number of gates would be reduced. Suppose wenow replace the standard zz-layer (with full connectivityof the original problem) with a set of reduced zz-layers(each generated by a subgraph containing at most half ofall edges in the original graph), then a circuit comprising2 such reduced zz-layers is shallower than the standard

H: Max Cut on ...... reduced graph operationpool Selected byDQAS

FIG. 5. Schematic workﬂow for reduced graph ansatz searchon MAX CUT setup. In reduced ansatz searching, there arevarious reduced graph backend zz-layer in the operation pool.These reduced graph are sub graph instances from the prob-lem graph instance. DQAS can not only ﬁnd the optimallayout and optimal parameters θ , but also ﬁnd the best re-duced graph for these zz-layers. P = 1 QAOA circuit. As summarized below, ansatzbuilt from such reduced zz-layers is more resource eﬃ-cient and outperforms the vanilla QAOA layout usingthe same number of quantum gates. Fig. 5 summarizesthe DQAS workﬂow in searching ansatz with reduced zz-layers.To demonstrate the eﬀectiveness of this strategy, weconsider the circuit design under instance learning setupin which reduced zz-layers in the operation pool are in-duced by the graph connectivity of a particular instance.In this numerical experiment, we again set out to designa p = 5 circuit for n = 8 qubits. More speciﬁcally, wegenerate 10 subgraph with edge density lower than halfof the base graph and substitute the base zz-layer withthese 10 newly introduced reduced zz-layers in the op-eration pool. In such a setup, DQAS is responsible forﬁnding (1) an optimal circuit layout of diﬀerent types oflayers, (2) best reduced graphs that give rise to the zz-layer in circuit, and (3) optimal parameters θ for rO-layerand zz-layer.Here we give a concrete example. For an arbitrarygraph instance drawn from the Erd¨os-R´enyi distributionwith a MAX CUT of 12, DQAS automatically designa circuit that exactly predicts the MAX CUT of 12.This p = 5 circuit is composed of following layers: rx-layer, zz-layer, zz-layer, ry-layer and rx-layer. Note thetwo zz-layers are induced by distinct sets of underlyingsubgraphs with only four edges each. As a comparison,the P = 1 vanilla QAOA gives expected MAX CUT of10 .

39, while P = 2 vanilla QAOA predicts 11 .

18. Thereduced ansatz designed by DQAS consumes about thesame amount of quantum resources as the P = 1 QAOAcircuit yet even outperforms the vanilla P = 2 QAOAcircuit. We stress that such an encouraging result is nota special case. By using the reduced ansatz layers, we canconsistently ﬁnd reduced ansatz that outperforms vanillaQAOA of the same depth for MAX CUT problems on avariety of unweighted and weighted graphs [40].DQAS not only can learn QAOA from scratch, butalso can easily ﬁnd better alternatives with shorter cir-cuit depth with an operation pool using slightly tweakedHamiltonian evolution blocks as primitive circuit layers.This last achievement is of paramount importance in theNISQ era where circuit depth is a key limitation. Discussion

DQAS is a versatile and useful tool in the NISQ era.Not only can DQAS handle the design of a quantum cir-cuit, but it can also be seamlessly tailored for a speciﬁcquantum hardware with customized noise model and na-tive gate set in order to get best results for error mitiga-tion. We have demonstrated the potential of DQAS withthe following examples: circuit design for state preparingand unitary decomposition (compilation), and noiselessand noisy circuit design for the hybrid quantum classi-cal computations. In particular, we also introduce thereduced ansatz design that proposes shallower circuitsthat outperforms the conventional QAOA circuits thatare inherently more resource intensive ansatz.In conclusion, we re-formulate the design of quantumcircuits and hybrid quantum-classical algorithm as an au-tomated diﬀerentiable quantum architecture search. In-spired by DARTS-like setup in NAS, DQAS works ina diﬀerentiable search space for quantum circuits. Bytweaking multiple ingredients in DQAS, the frameworkis highly ﬂexible and versatile. Not only can it be usedto design optimal quantum circuits under diﬀerent sce-narios but it also does the job in a highly customizedfashion that takes into account of native gate sets, hard-ware connectivity, and error models for speciﬁc quantumhardwares. The theoretical framework itself oﬀers a fer-tile ground for further study as it draws advanced con-cepts and techniques from the newly emerged interfaceof diﬀerential, probabilistic, and quantum programmingparadigms.

Methods

Connections to DARTS.

We illustrate how DQAS is re-lated to DARTS in the neural architecture search. In par-ticular, we draw attention to a speciﬁc variant of DARTS,the probabilistic neural architecture search [47] that employsa probabilistic model as the backend NAS. Both frameworksrepresent the super network (during an architecture search)in terms of a probabilistic model, and relies on Monte Carlosampling along with the score-function method to evaluategradients for structural variables etc.Diﬀerent from DARTS, the probabilistic description of thesuper network is not just an optional approach for avoid-ing memory-intensive operations to deterministically evaluatethe super network but rather an indispensable ingredient ofDQAS for circuit design in the NISQ era. More precisely,the super network analogy of quantum circuit dictates animplementation of a complex (and potentially non-unitary)operation comprising elementary unitraies U j present in theoperation pool, U = p (cid:89) i =0 ( (cid:88) j α ij U j ) . (12)This complex operation may be implemented in a quantumcircuit via methods like linear combination of unitaries [72], Input data Output stateOutput label Input state

Conv Zero PoolingConv Zero Pooling (a) DARTS (b) DQAS

FIG. 6. Comparison between setup of (a) DARTS and (b)DQAS in this work. In DARTS, the super network with allpaths are evaluated at the same time, while in DQAS, to bein accordance with quantum circuit, only one path is evalu-ated once as a quantum circuit simulation (indicated by thesolid line as shown in (b)), and diﬀerent path choices are de-termined by the underlying probabilistic model.which is expensive in the NISQ era. The alternative basedon the probabilistic model tremendously reduces the near-term implementation challenges by sampling a batch of sim-pler quantum circuits, each only carries out an easily imple-mentable unitary transformation. Other than this implemen-tation issue in quantum circuits to motivate a probabilisticsolution, DQAS and probabilistic DARTS are highly similar.The comparison is more succinctly summarized in Fig. 6.There are various training ingredients that can be incor-porated into the DQAS framework and many of these tricksand/or improvements are inspired by the works devoted todeveloping DARTS. In this subsection, we elaborate on someof these advanced techniques that we have tested.

Multiple starts.

Since the energy landscape for objectivefunctions may be very rugged, parallel training on multipleinstances with diﬀerent fractions of dataset, initialization orrandomization scheme may be necessary, where the best can-didate circuit with optimal objective value amongst all train-ing instances is returned as the ﬁnal result.

Parameters prethermalization.

Pretraining and updat-ing circuit trainable parameters θ from the parameter poolfor several epochs as the prethermalization process. A re-lated topic is parameters initializations. Since we would liketo search quantum structure without bias, all zero initializerfor structural probabilistic model parameters α is used. Fortrainable parameters θ , sometimes we can apply domain spe-ciﬁc knowledge on the initialization. See QAOA applicationsin the main text for an example. Early stopping.

To avoid overﬁtting and reduce runtime,some forms of early stopping may be adopted during the train-ing of DQAS. Following the common practices in trainingDARTS, we may consider typical criteria, such as the stan-dard deviation in a batch of objective evaluations L or the standard deviation in probability of each layer P , to decidewhen to invoke the early stopping. The performance gains ofDARTS, due to early stopping, were documented in [43, 44]. Top-k grid search or beam search.

If the energy land-scape is suﬃciently complex, DQAS may settle for a localoptimal solution instead of a global one. In such cases, earlystopping can be combined with the so called top-k grid searchto avoid trapping into a local minimum. Namely, for eachlayer of the ansatz, we keep the top k (usually k = 2) mostprobable operations instead of top-1. Therefore, we have p k candidates for the optimal circuit ansatz. One can easily trainthese candidate circuits, benchmark their performance , andpick the top performing one as the optimal circuit architecturefor a given problem. Similarly, we can utilize beam search foroptimal structure search, which is the common strategy fromtext decoder in NLP community. Beam search always main-tains k most probable circuit structures and thus avoids theexponential scaling as in grid search. Baseline for score function estimators.

It is well knownthat the score function for the Monte-Carlo estimated gradi-ents suﬀers high variance; although, the situation can some-how be alleviated by the baseline or control variate ap-proach. Namely, for the normalized probability distribution P , (cid:80) k ∼ P ∇ ln P ( k ) = 0, one can add any constant in theobjectives’ gradient as a baseline in order to reduce the vari-ance. For instance, during the training of DQAS, we use therunning average of L as the baseline. Namely, the loss in Eq.(8) is actually L = L − ¯ L , where ¯ L is the average of objectivefrom the last evaluated batch. Layer-by-layer learning.

For deep quantum circuits withlarge p , DQAS may be hard to train from bootstrap. Inspiredby the progressive training [46, 73] for DARTS and quantumneural network training, we apply similar ideas to DQAS.Namely, one ﬁrst ﬁnd optimal quantum structure with small p and then adaptively increase p by adding more layers tobe trained. In this process, one can also reduce the numberof candidate operations in the pool based on the knowledgegained from training instances with smaller p . Random noise on parameters θ . The high-dimensionalenergy landscape can be very rugged in theory. However,based on some numerical evidences, the optimal quantumcircuit tends to consistently output similar objective valueseven when trainable parameters θ deviate slightly from theoptimal values. This observation suggests the landscape foroptimal quantum structure in terms of trainable parametersis expected to be more ﬂatten than expected. Therefore, tofacilitate the search for an optimal circuit architecture, onemay add random noise to the trainable parameters θ to es-cape the local trapping. It is worth noting that the randomnoises are only added onto trainable/network parameters inour setup instead of structural parameters as in [74] whichtried to bridge the gap of performance between two stages inDARTS. Regularization and penalty terms.

Similar to conven-tional practices in training neural networks, regularizationand penalty terms may be introduced in DQAS to avoid over-ﬁtting or induce sparsity etc. The applicability of regulariza-tion on tunable parameter θ (with respect to a ﬁxed circuitdesign) may be easily understood given the high similarity be- tween neural networks and parameterized quantum circuits.In this section, instead, we focus on the aspect of impos-ing regularizations on the structural parameters α and mani-fest the beneﬁts of regularization on searching for a resource-eﬃcient architecture. We provide two concrete examples forillustrations. The ﬁrst example deals with the issue of blockmerging in DQAS. For the simplest probabilistic model for P ( (cid:126)k, (cid:126)α ), each circuit layer is independently sampled. Thereis a high probability that the same parametrized gates arepicked in a consecutive order. Namely, the ﬁnal architecturemay contain snippets like rx ( θ ) rx ( θ ), which can be easilymerged into one layer. To address this issue, we propose toadd the following terms in the ﬁnal objective L to punish suchtrivial arrangement of circuit layers,∆ L = λ p (cid:88) i =1 (cid:88) k ∈ c p ( k i = k, α ) p ( k i − = k, α ) . (13)Secondly, since two-qubit gates are primarily responsiblefor inﬁdelity and errors of quantum computations, it is desir-able to select a circuit architecture comprising fewer numberof two-qubit gates. This kind of resource considerations areencouraged during the architecture search if penalty terms ofthe following form are explicitly added,∆ L = λ p (cid:88) i =1 (cid:88) k ∈ c p ( k i = k ) × . (14)We note similar regularization for achieving better trainingperformance [44] and the multiple objective considerationsspeciﬁcally on computation complexity [75] have also beenreported in the NAS literature. Proxy tasks and transfer learning.

DARTS heavily relieson the idea of proxy tasks to boost performance. In DARTStraining, one ﬁrst trains and identiﬁes a suitable network ar-chitecutre on the simpler CIFAR-10 (image) dataset; subse-quently, one uses the same block topology to build neural net-works classiﬁers for the large-scale ImageNet dataset. Sametechnique may be adapted to DQAS: ﬁnding some structureor patterns in quantum circuits for small size problems withsmall number of qubits or layers and try to apply similarpattern on larger problems. For the quantum circuit design,we can even classically simulate the training for small proxytasks and transfer optimal structures to large problem beyondclassical computation power.We recommend various training techniques, inspired byDARTS-related studies, to obtain more robust and versatileDQAS. For more details on training techniques and hyper-parameters we utilized in our numerical experiments, see theSupplemental Materials [40]. Due to the close relation be-tween architecture search in the context of quantum circuitand neural network, more interesting ideas may be borrowedfrom NAS to further improve QAS and innovations in QASmay also inspire developments in NAS.

Acknowledgments

This work is supported in part by the NSFC under GrantNo. 11825404 (SXZ and HY), the MOSTC under GrantNos. 2016YFA0301001 and 2018YFA0305604 (HY), theStrategic Priority Research Program of Chinese Academyof Sciences under Grant No. XDB28000000 (HY), Bei-jing Municipal Science and Technology Commission un- der Grant No. Z181100004218001 (HY), and Beijing Nat-ural Science Foundation under Grant No. Z180010 (HY). [1] J. Preskill, Quantum Computing in the NISQ era andbeyond, Quantum , 79 (2018).[2] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q.Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien,A variational eigenvalue solver on a photonic quantumprocessor, Nat. Commun. , 4213 (2014).[3] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru-Guzik, The theory of variational hybrid quantum-classical algorithms, New J. Phys. , 023023 (2016).[4] S. McArdle, S. Endo, A. Aspuru-Guzik, S. C. Benjamin,and X. Yuan, Quantum computational chemistry, Rev.Mod. Phys. , 015003 (2020).[5] E. Farhi, J. Goldstone, and S. Gutmann, A QuantumApproximate Optimization Algorithm, arXiv:1411.4028(2014).[6] S. Hadﬁeld, Z. Wang, B. O’Gorman, E. Rieﬀel, D. Ven-turelli, and R. Biswas, From the Quantum ApproximateOptimization Algorithm to a Quantum Alternating Op-erator Ansatz, Algorithms , 34 (2019).[7] L. Zhou, S.-T. Wang, S. Choi, H. Pichler, and M. D.Lukin, Quantum Approximate Optimization Algorithm:Performance, Mechanism, and Implementation on Near-Term Devices, Phys. Rev. X , 021067 (2020).[8] E. Farhi and H. Neven, Classiﬁcation with Quantum Neu-ral Networks on Near Term Processors, arXiv:1802.06002(2018).[9] G. Verdon, J. Marks, S. Nanda, S. Leichenauer,and J. Hidary, Quantum Hamiltonian-Based Modelsand the Variational Quantum Thermalizer Algorithm,arXiv:1910.02071 (2019).[10] M. Benedetti, E. Lloyd, S. Sack, and M. Fiorentini, Pa-rameterized quantum circuits as machine learning mod-els, Quantum Sci. Technol. , 043001 (2019).[11] M. Benedetti, D. Garcia-pintos, O. Perdomo, V. Leyton-Ortega, Y. Nam, and A. Perdomo-Ortiz, A generativemodeling approach for benchmarking and training shal-low quantum circuits, npj Quantum Inf. , 45 (2019).[12] I. Cong, S. Choi, and M. D. Lukin, Quantum convolu-tional neural networks, Nat. Phys. , 1273 (2019).[13] V. Akshay, H. Philathong, M. E. S. Morales, and J. D.Biamonte, Reachability Deﬁcits in Quantum Approxi-mate Optimization, Phys. Rev. Lett. , 090504 (2020).[14] E. Farhi, D. Gamarnik, and S. Gutmann, The QuantumApproximate Optimization Algorithm Needs to See theWhole Graph: A Typical Case, arXiv:2004.09002 (2020).[15] B. T. Kiani, S. Lloyd, and R. Maity, Learning Unitariesby Gradient Descent, arXiv:2001.11897 (2020).[16] R. Li, U. Alvarez-Rodriguez, L. Lamata, and E. Solano,Approximate Quantum Adders with Genetic Algorithms:An IBM Quantum Experience, Quantum Meas. Quan-tum Metrol. , 1 (2017).[17] O. M. Sotnikov, L. Cincio, Y. Subası, A. T. Sornborger,and P. J. Coles, Learning the quantum algorithm for stateoverlap, New J. Phys. , 113022 (2018).[18] Z.-C. Yang, A. Rahmani, A. Shabani, H. Neven, andC. Chamon, Optimizing Variational Quantum Algo-rithms Using Pontryagin’s Minimum Principle, Phys. Rev. X , 021027 (2017).[19] T. F¨osel, P. Tighineanu, T. Weiss, and F. Marquardt,Reinforcement Learning with Neural Networks for Quan-tum Feedback, Phys. Rev. X , 31084 (2018).[20] C. Lin, Y. Wang, G. Kolesov, and U. Kalabi´c, Applica-tion of Pontryagin’s minimum principle to Grover’s quan-tum search problem, Phys. Rev. A , 022327 (2019).[21] A. G. Rattew, S. Hu, M. Pistoia, R. Chen, andS. Wood, A Domain-agnostic, Noise-resistant, Hardware-eﬃcient Evolutionary Variational Quantum Eigensolver,arXiv:1910.09694 (2019).[22] D. Chivilikhin, A. Samarin, V. Ulyantsev, I. Iorsh,A. R. Oganov, and O. Kyriienko, MoG-VQE: Mul-tiobjective genetic variational quantum eigensolver,arXiv:2007.04424 (2020).[23] L. Cincio, K. Rudinger, M. Sarovar, and P. J. Coles,Machine learning of noise-resilient quantum circuits,arXIv:2007.01210 (2020).[24] Q. Yao, M. Wang, Y. Chen, W. Dai, Y.-F. Li, W.-W. Tu,Q. Yang, and Y. Yu, Taking Human out of Learning Ap-plications: A Survey on Automated Machine Learning,arXiv:1810.13306 (2018).[25] S. Huang, X. Li, Z.-Q. Cheng, Z. Zhang, and A. Haupt-mann, in 2018 ACM Multimed. Conf. Multimed. Conf. -MM ’18 (ACM Press, New York, New York, USA, 2018)pp. 2049–2057.[26] K. O. Stanley, J. Clune, J. Lehman, and R. Miikku-lainen, Designing neural networks through neuroevolu-tion, Nat. Mach. Intell. , 24 (2019).[27] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu,J. Tan, Q. V. Le, and A. Kurakin, Large-scale evolutionof image classiﬁers, 34th Int. Conf. Mach. Learn. ICML2017 , 4429 (2017).[28] L. Xie and A. Yuille, Genetic CNN, Proc. IEEE Int. Conf.Comput. Vis. , 1388 (2017).[29] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, andK. Kavukcuoglu, Hierarchical Representations for Eﬃ-cient Architecture Search, arXiv:1711.00436 (2017).[30] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, Regular-ized Evolution for Image Classiﬁer Architecture Search,Proc. AAAI Conf. Artif. Intell. , 4780 (2019).[31] B. Zoph and Q. V. Le, Neural Architecture Search withReinforcement Learning, 5th Int. Conf. Learn. Represent.ICLR 2017 - Conf. Track Proc. (2016).[32] B. Baker, O. Gupta, N. Naik, and R. Raskar, Designingneural network architectures using reinforcement learn-ing, 5th Int. Conf. Learn. Represent. ICLR 2017 - Conf.Track Proc. , 1 (2017).[33] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang,Eﬃcient architecture search by network transformation,32nd AAAI Conf. Artif. Intell. AAAI 2018 , 2787 (2018).[34] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, Learn-ing Transferable Architectures for Scalable Image Recog-nition, Proc. IEEE Comput. Soc. Conf. Comput. Vis.Pattern Recognit. , 8697 (2018).[35] M. Ostaszewski, E. Grant, and M. Benedetti, Quantumcircuit structure learning, arXiv:1905.09692 (2019). [36] H. R. Grimsley, S. E. Economou, E. Barnes, and N. J.Mayhall, An adaptive variational algorithm for exactmolecular simulations on a quantum computer, Nat.Commun. , 3007 (2019).[37] L. Li, M. Fan, M. Coram, P. Riley, and S. Leichenauer,Quantum optimization with a novel Gibbs objective func-tion and ansatz architecture search, Phys. Rev. Res. ,023074 (2020).[38] U. Las Heras, U. Alvarez-Rodriguez, E. Solano, andM. Sanz, Genetic Algorithms for Digital Quantum Sim-ulations, Phys. Rev. Lett. , 230504 (2016).[39] M. Y. Niu, S. Boixo, V. N. Smelyanskiy, and H. Neven,Universal quantum control through deep reinforcementlearning, npj Quantum Inf. , 33 (2019).[40] See Supplemental Materials for informations on: 1. Back-ground knowledge and related work on ﬁelds of DARTS,QAS and QAOA. 2. DQAS applications on state prepa-ration and unitary learning. 3. Relevant hyperparameterand ingredient settings on experiments in this work. 4.More results and comparisons on QEM for QFT circuit.5. More DQAS results for QAOA ansatz search includinginstance learning, block encoding, etc.[41] H. Liu, K. Simonyan, and Y. Yang, DARTS: Diﬀeren-tiable architecture search, 7th Int. Conf. Learn. Repre-sent. ICLR 2019 (2019).[42] S. Xie, H. Zheng, C. Liu, and L. Lin, SNAS: Stochasticneural architecture search, 7th Int. Conf. Learn. Repre-sent. ICLR 2019 (2019).[43] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang,and Z. Li, DARTS+: Improved Diﬀerentiable Archi-tecture Search with Early Stopping, arXiv:1909.06035(2019).[44] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox,and F. Hutter, Understanding and Robustifying Diﬀer-entiable Architecture Search, ICLR 2019 (2019).[45] A. Hundt, V. Jain, and G. D. Hager, sharpDARTS:Faster and More Accurate Diﬀerentiable ArchitectureSearch, arXiv:1903.09900 (2019).[46] X. Chen, L. Xie, J. Wu, and Q. Tian, Progressive dif-ferentiable architecture search: Bridging the depth gapbetween search and evaluation, Proc. IEEE Int. Conf.Comput. Vis. , 1294 (2019).[47] F. P. Casale, J. Gordon, and N. Fusi, Probabilistic Neu-ral Architecture Search, arXiv:1902.05116 (2019).[48] P. K. Barkoutsos, G. Nannicini, A. Robert, I. Tavernelli,and S. Woerner, Improving Variational Quantum Opti-mization using CVaR, Quantum , 256 (2020).[49] Y. Nakata, C. Hirche, C. Morgan, and A. Winter, Uni-tary 2-designs from random X - and Z -diagonal unitaries,J. Math. Phys. , 052203 (2017).[50] G. E. Crooks, Gradients of parameterized quantum gatesusing the parameter-shift rule and gate decomposition,arXiv:1905.13311 (2019).[51] A. Harrow and J. Napp, Low-depth gradient measure-ments can improve convergence in variational hybridquantum-classical algorithms, arXiv:1901.05374 (2019).[52] S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih,Monte Carlo Gradient Estimation in Machine Learning,arXiv:1906.10652 (2019).[53] J. P. Kleijnen and R. Y. Rubinstein, Optimization andsensitivity analysis of computer simulation models by thescore function method, Eur. J. Oper. Res. , 413 (1996).[54] R. J. Williams, Simple statistical gradient-following al-gorithms for connectionist reinforcement learning, Mach. Learn. , 229 (1992).[55] D. P. Kingma and M. Welling, Auto-Encoding Varia-tional Bayes, Proc. 2nd Int. Conf. Learn. Represent.(2014).[56] J. Foerster, G. Farquhar, T. Rockt¨aschel, S. Whiteson,M. Al-Shedivat, and E. P. Xing, DICE: The inﬁnitely dif-ferentiable Monte Carlo estimator, Proc. 35th Int. Conf.Mach. Learn. (2018).[57] S.-X. Zhang, Z.-Q. Wan, and H. Yao, AutomaticDiﬀerentiable Monte Carlo: Theory and Application,arXiv:1911.09117 (2019).[58] G. E. Hinton, A practical guide to training restrictedboltzmann machines, Neural Networks: Tricks of theTrade , 599 (2012).[59] G. Carleo and M. Troyer, Solving the quantum many-body problem with artiﬁcial neural networks, Science , 602 (2017).[60] M. Germain, K. Gregor, I. Murray, and H. Larochelle,MADE: Masked autoencoder for distribution estimation,32nd Int. Conf. Mach. Learn. ICML 2015 , 881 (2015).[61] D. Wu, L. Wang, and P. Zhang, Solving StatisticalMechanics Using Variational Autoregressive Networks,Phys. Rev. Lett. , 080602 (2019).[62] O. Sharir, Y. Levine, N. Wies, G. Carleo, andA. Shashua, Deep autoregressive models for the eﬃcientvariational simulation of many-body quantum systems,arXiv:1902.04057 (2019).[63] J.-G. Liu, L. Mao, P. Zhang, and L. Wang,Solving Quantum Statistical Mechanics with Varia-tional Autoregressive Networks and Quantum Circuits,arXiv:1912.11381 (2019).[64] Code implementation of DQAS and its applications canbe found at https://github.com/refraction-ray/tensorcircuit/tree/master/tensorcircuit/applications .[65] See https://github.com/quantumlib/Cirq .[66] See https://github.com/tensorflow/quantum .[67] See https://github.com/google/tensornetwork .[68] See https://github.com/refraction-ray/tensorcircuit .[69] J. J. Wallman and J. Emerson, Noise tailoring for scalablequantum computation via randomized compiling, Phys.Rev. A , 052325 (2016).[70] A. Zlokapa and A. Gheorghiu, A deep learning modelfor noise prediction on near-term quantum devices,arXiv:2005.10811 (2020).[71] F. G. S. L. Brandao, M. Broughton, E. Farhi, S. Gut-mann, and H. Neven, For Fixed Control Parameters theQuantum Approximate Optimization Algorithm’s Objec-tive Function Value Concentrates for Typical Instances,arXiv:1812.04170 (2018).[72] A. M. Childs and N. Wiebe, Hamiltonian simulation us-ing linear combinations of unitary operations, QuantumInf. Comput. , 901 (2012).[73] A. Skolik, J. R. McClean, M. Mohseni, P. van der Smagt,and M. Leib, Layerwise learning for quantum neural net-works, arXiv:2006.14904 (2020).[74] X. Chen, C.-J. Hsieh, P.-b. Regularization, and X. C.C.-j. Hsieh, Stabilizing Diﬀerentiable Architecture Searchvia Perturbation-based Regularization, Proc. 37th Int.Conf. Mach. Learn. Vienna, Austria, PMLR 119 (2020).[75] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb,E. Goodman, and W. Banzhaf, NSGA-Net: Neural Ar-chitecture Search using Multi-Objective Genetic Algo- rithm, arXiv:1810.03522 (2018).[76] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li,X. Chen, and X. Wang, A Comprehensive Survey ofNeural Architecture Search: Challenges and Solutions,arXiv:2006.02903 (2020).[77] X. Dong and Y. Yang, Searching for a robust neural ar-chitecture in four GPU hours, Proc. IEEE Comput. Soc.Conf. Comput. Vis. Pattern Recognit. , 1761 (2019).[78] Q. Yao, J. Xu, W.-W. Tu, and Z. Zhu, EﬃcientNeural Architecture Search via Proximal Iterations,arXiv:1905.13577 (2019).[79] A. Noy, N. Nayman, T. Ridnik, N. Zamir, S. Doveh,I. Friedman, R. Giryes, and L. Zelnik-Manor, ASAP: Ar-chitecture Search, Anneal and Prune, arXiv:1904.04123(2019).[80] H. Cai, L. Zhu, and S. Han, Proxyless Nas: Direct NeuralArchitecture Search on Target Task and Hardware, ICLR2019 (2019).[81] G. Li, G. Qian, I. C. Delgadillo, M. M¨uller, A. Thabet,and B. Ghanem, SGAS: Sequential Greedy ArchitectureSearch, arXiv:1912.00195 (2019).[82] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian,and H. Xiong, PC-DARTS: Partial Channel Connectionsfor Memory-Eﬃcient Architecture Search, ICLR 2020(2019). [83] J. Chang, DATA : Diﬀerentiable ArchiTecture Approxi-mation, 33rd Conf. Neural Inf. Process. Syst. (2019).[84] S. Hu, S. Xie, H. Zheng, C. Liu, J. Shi, X. Liu, andD. Lin, DSNAS: Direct Neural Architecture Search with-out Parameter Retraining, arXiv:2002.09128 (2020).[85] J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush,and H. Neven, Barren plateaus in quantum neural net-work training landscapes, Nat. Commun. , 4812 (2018).[86] E. Crosson, E. Farhi, C. Y.-Y. Lin, H.-H. Lin, andP. Shor, Diﬀerent Strategies for Optimization Using theQuantum Adiabatic Algorithm, arXiv:1401.7320 (2014).[87] T. Albash, Role of nonstoquastic catalysts in quan-tum adiabatic optimization, Phys. Rev. A (2019),10.1103/PhysRevA.99.042334.[88] D. Sels and A. Polkovnikov, Minimizing irreversible lossesin quantum systems by local counterdiabatic driving,Proc. Natl. Acad. Sci. U. S. A. , E3909 (2017).[89] A. Hartmann and W. Lechner, Rapid counter-diabaticsweeps in lattice gauge adiabatic quantum computing,New J. Phys. , 043025 (2019).[90] I. Beltagy, M. E. Peters, and A. Cohan, Longformer: TheLong-Document Transformer, arXiv:2004.05150 (2020).[91] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Al-berti, S. Ontanon, P. Pham, A. Ravula, Q. Wang,L. Yang, and A. Ahmed, Big Bird: Transformers forLonger Sequences, arXiv:2007.14062 (2020). SUPPLEMENTAL MATERIALSA. Background and Related Work

Diﬀerentiable Neural Architecture Search.

NAS [76] is a burgeoning and active ﬁeld in AutoML, and theultimate goal of NAS is to automate the search for a top-performing neural network architectures for any given task.Popular approaches to implement NAS include reinforcement learning [31], in which an RNN controller chooses anaction on building the network structure layerwise from a discrete set of options; and evolutionary or genetic algorithms[27, 28, 30], in which a population of network architectures is kept, evaluated, mutated for the ﬁttest candidates. SuchRL or evolutionary algorithms are rather resource intensive and time consuming, since the core task involves searchingthrough an exponentially large space of discrete choices for diﬀerent elementary network components.Recently, diﬀerentiable architecture search [41] and its variants have been proposed and witnessed a surge in thenumber of related NAS studies [42–47, 74, 77–84]. Under the DARTS framework, the network architecture spaceof discrete components is relaxed into a continuous domain that facilitates search by diﬀerentiation and gradientdescent. The relaxed searching problem can be eﬃciently solved with noticeably reduced training time and hardwarerequirements.In the original DARTS, the search space concerns with choices of distinct microstructures within one cell. Twotypes of cell are assumed for the networks: normal cell and reduction cell. The NAS proceeds by ﬁrst determiningthe microstructures within these two types of cell, then a large network is built by stacking these two cell typesup to a variable depth with arbitrary input and output size. Within each cell, two inputs, four intermediate nodesand one output (concatenation of four intermediate nodes) are presented as nodes in a directional acylic graph. Foreach edge between nodes, one needs to determine optimal connection layers, eg. conv with certain kernel size, ormax/average pooling with given window size, zero/identity connections and so on. To make such search processdiﬀerentiable, we assume each edge is actually the weighted sum of all these primitive operations from the pool,i.e. o ( x ) = (cid:80) i softmax( α i ) o i ( x ) where o i stands for i-th type of layers primitives and α i is the continuous weightswhich determines the structure of neural network as structural parameters. Therefore, we have two sets of continuousparameters: structure weights α which determines the optimal network architecture by pruning in evaluation stage,and conventional parameters in neural network ω . Via DARTS setup, neural architecture search turns into a bi-optimization problem where diﬀerentiation is carried out end-to-end.DARTS requires the evaluation of the whole super network where each edge is composed of all types layers. Thisis memory intensive and limits its usage on large dataset or enriched cell structures. Therefore, there are works2extending DARTS idea while enabling forward evaluation on sub network, usually using only one path [47, 84] ortwo [80]. Speciﬁcally, in [47], the authors viewed the super network as a probabilistic ensemble of subnetworksand thus variational structural parameters enter into NAS as probabilistic model parameters instead. So we cansample subnetworks from such probabilistic distribution and evaluate one subnetwork each time. This is feasible asprobabilistic model parameters can also be updated from general theory for Monte Carlo expectations’ gradient in adiﬀerentiable approach[52].There are additional follow-up works that focus on improving drawbacks of DARTS with various training techniques.In general, these DARTS-related techniques are also illuminating and inspirational for further DQAS developmentsin our work. Related works on QAS.

Quantum architecture search, though no one brand it as this name before, is scatteredin the literature with diﬀerent contexts. These works are often speciﬁc to problem setup and denoted as quantumcircuit structure learning [35], adaptive variational algorithm [36], ansatz architecture search [37], evolutional VQE[21], multipleobjective genetic VQE [22] or noise-aware circuit learning [23]. The tasks they focused are mainly inQAOA [37] or VQE [21, 22, 35, 36] settings. From higher theoretical perspective, some quantum control works canalso be classiﬁed as QAS tasks, where optimal quantum control protocol is explored using possible machine learningtools [19, 39] .These QAS relevant works are closed related to NAS methodologies. And this relevance is as expected, sincequantum circuit and neural network structure share a great proportion of similarities. The mainstream approach ofQAS is evolution/genetic algorithms with diﬀerent variants on mutation, crossover or tournament details [16, 17, 21–23, 38]. There are also works exploiting simple greedy/locality ideas [35–37] and reinforce learning ideas [19, 39].All of the QAS works mentioned above are still searching quantum ansatz/architecture in discrete domain, whichincreases the diﬃculty on search and is in general time consuming. Due to the close relation between QAS and NAStogether with the great success of diﬀerentiable NAS ideas in machine learning, we here introduce a framework ofdiﬀerentiable QAS that enable end-to-end automatic diﬀerentiable QAS (DQAS). This new approach unlocks morepossibilities than previous works with less searching time and more versatile capabilities. It is designed with generalQAS philosophy in mind, and DQAS framework is hence universal for all types of circuit searching tasks, instead offocusing only on one type of quantum computing tasks as previous work.

Brief review on QAOA.

As introduced in [5], QAOA is designed to solve classical combinatorial optimization (CO)problems. These problems are often NP complete, such as MAX CUT or MIS in the graph theory. The basic ideais that we prepare a variational quantum circuit by alternately applying two distinct Hamiltonian evolution blocks.Namely, a standard QAOA anstaz reads | ψ (cid:105) = P (cid:89) j =0 ( e iH c γ j e iH b β j ) | ψ (cid:105) , (A1)where | ψ (cid:105) should be prepared in the space of feasible solutions (better as even superposition of all possible states,and in MAX CUT case | ψ (cid:105) = ⊗ H n | n (cid:105) , where n is the number of qubits and H is transversal Hadamard gates.)In general, H c is the objective Hamiltonian as H c | ψ (cid:105) = f ( ψ ) | ψ (cid:105) , where f ( ψ ) is the CO objective. For MAX CUTproblem on weighted graph with weight ω ij on edge ij , H c = − (cid:80) ω ij Z i Z j up to some unimportant phase. (We usethe notation X/Y /Z i for Pauli operators on i-th qubit throughout this work) H b is the mixer Hamiltonian to tunneldiﬀerent feasible solutions, where H b = (cid:80) ni =0 X i is the most common one when there is no limitation on feasibleHilbert space.The correctness of such ansatz is guaranteed when p approach inﬁnity as it can be viewed as quantum annealing(QA), where we start from the ground state of Hamiltonian H b as | + n (cid:105) and go through adiabatically to anotherHamiltonian H c , then it is expected that the ﬁnal output state is the ground state of H c which of course has theminimum energy/objective and thus solve the corresponding CO problems.If we relax the strong restrictions from the QA limit and just take QAOA as some form of variational ansatz, thenthere are four Hamiltonians instead of two deﬁning the ansatz. • H p the preparation Hamiltonian: we should prepare the initial states from zero product to the ground state of H p . In original case, H p is the same as H b . • H b the mixer Hamiltonian: responsible to make feasible states transitions happen. • H p the phase/problem Hamiltonian: time evolution under the phase Hamiltonian and the mixer Hamiltonianalternately makes the bulk of the circuit, in original QAOA, H p is the same as H c .3 π ) X X (a)

HH X H X (b)

FIG. A1. (a) The minimal circuit for preparation of GHZ states automatically found by DQAS, where R y ( θ ) = e − i θ . (b)The circuit with p = 5 found by DQAS for Bell states transformation. • H c the cost Hamiltonian: the Hamiltonian used in objectives and measurements where (cid:104) ψ | H c | ψ (cid:105) is optimized.Moreover, such four Hamiltonian generalization of original QAOA can be further extended. For example, H b , H p are not necessarily the same Hamiltonian for each layer of the circuit. Nonetheless, the essence of such ansatz is that:the number of variational parameters is of order the same as layer number P which is much less than other variationalansatz of the same depth such as typical hardware eﬃcient VQE or quantum neural network design. This fact rendersQAOA easier to train than VQE of the same depth and suﬀers less from barren plateus [85]. And as QAOA ansatzhas some reminiscent from QA, the ﬁnal ansatz has better interpretation ability than typical random circuit ansatz.It is an interesting direction to search for the four deﬁnition Hamiltonians or even more general layouts beyond vanillaQAOA, to see whether there are similar quantum architectures that can outperform vanilla QAOA in CO problems,this is where DQAS plays a role.The physical intuition behind such QAOA type ansatz relaxation and searching originates from the close relationbetween QAOA and quantum adiabatic annealing. In particular, we draw inspirations from eﬀorts to optimizeannealing paths and boost performance for quantum annealers. We reckon at least two fronts to search for betteransatz for the hybrid quantum-classical algorithm. The ﬁrst case is to actually inspect the standard QAOA (whichtypically uses two alternating Hamiltonians to build the ansatz) and inquire if any ingredient may be improved. Forinstance, given the four Hamiltonians for the quantum-adiabatic inspired ansatz introduced in Sec. A, one may searchfor a better initial-state-preparation Hamiltonian, or ﬁnd better mixer Hamiltonians than the plain (cid:80) i X i for speciﬁcproblems. Another inspiration derives from attempts to speed up quantum adiabatic annealing via ideas like catalystHamiltonians [86, 87], counter diabatic Hamiltonians [88, 89], and other ideas in shortcut to adiabacity. These ultrafstannealing methods would entail design of complex annealing schedules that deviate from the simple linear scheduleinterpolating between an initial Hamiltonian and the target Hamiltonian. When these complex annealing paths aredigitalized and projected onto the quantum gate model with variational approximations, they may just live in theform of XX Hamiltonians or local Y Hamiltonians. With these extra Hamiltonians, catalyst or counter diabatic, weanticipate better performances with shallower QAOA-like circuits layout may be achieved. B. DQAS application in state preparation and circuit compiling

State preparation circuit.

We set out to design a quantum circuit for generating GHZ states | GHZ n (cid:105) = √ ( | n (cid:105) + | n (cid:105) ) from | n (cid:105) . To ﬁnd an optimal structure with less redundancy, we may progressively reduce the layer number p in DQAS until the objective can no longer be accomplished. In this case, the operation pool is composed of primitivegates such as single qubit gates and CNOT gates on any pair of qubit. In principle, the availability of CNOT gates inthe operation pool may be further subjected to the connectivity map of an actual hardware. The objective we choosefor this problem is the ﬁnal states distance given by (cid:80) i | ψ i − φ i | where | φ (cid:105) is the target GHZ state. Such a metric isbetter to optimize than the typical ﬁdelity or state-overlap objective (cid:104) ψ | φ (cid:105) . The optimal circuit found by DQAS isshown in Fig. A1(a). It is interesting to observe that DQAS tries to optimize R y ( θ ) by tuning θ to approximate thebehavior of Hadamard gate when Hadamard gate is not given in the operation pool. Unitary decomposition.

For state-preparation example, we only care about how the circuit acts on the input state | n (cid:105) and ignore how other input states could be transformed. This lack of consideration is obvious in the chosenobjective above. For the current example, we aim to decompose arbitrary unitray operation into a set of primitivequantum gates and this implies all inputs transformation are considered. For a concrete illustration, we use DQASto design a quantum circuit for 2-qubits Bell state generation, which is useful for superdense coding. We need 2 = 4independent input-output pairs to fully characterize the two-qubit unitary under investigation. For instance, the Bell4 input output Z Z X X | (cid:105) √ ( | (cid:105) + | (cid:105) ) +1 +1 | (cid:105) √ ( | (cid:105) − | (cid:105) ) -1 -1 | (cid:105) √ ( | (cid:105) + | (cid:105) ) -1 +1 | (cid:105) √ ( | (cid:105) − | (cid:105) ) +1 -1TABLE I. Speciﬁcation of a Bell circuit. state preparation circuit needs to conform to the following input/output relations, convert the input | (cid:105) and | (cid:105) tothe Bell states √ ( | (cid:105) ± | (cid:105) ), and covert he input | (cid:105) and | (cid:105) to √ ( | (cid:105) ± | (cid:105) ), respectively.In the next example, we illustrate how to assemble this Bell-state preparation circuit with a ﬁnite set of quantumgates. In other words, we purposely restrict the operation pool to a ﬁnite number of discrete gates without anytrainable parameters such as rotation angles. To apply DQAS to search for a Bell-state preparation circuit, we usethe input/output relations in Table. I to build the objective function, L = −(cid:104) Z Z (cid:105) U | (cid:105) − (cid:104) X X (cid:105) U | (cid:105) + (cid:104) Z Z (cid:105) U | (cid:105) + (cid:104) X X (cid:105) U | (cid:105) + (cid:104) Z Z (cid:105) U | (cid:105) − (cid:104) X X (cid:105) U | (cid:105) −(cid:104) Z Z (cid:105) U | (cid:105) + (cid:104) X X (cid:105) U | (cid:105) , (A2)where U is the tentative circuit proposed by DQAS. The ﬁnal ( p = 5-depth) circuit obtained via DQAS is presentedin Fig. A1(b). C. Hyper parameter settings and training ingredients in experiments

We summarize some of the most important ingredients for DQAS below, and leave the extensive investigation onthe eﬀects of these adjustable ingredients as well as other ones to future works.1. Ingredients in common machine learning setup: optimizers, learning rate and schedule for both trainable pa-rameters θ and structural parameters α of probabilistic models. Since one epoch of evaluation for DQAS ismore expensive than conventional neural network evaluations, we may need to ﬁnd better learning schedules toboost training eﬃciency for DQAS.2. Batch sizes: This factor plays a very important role in DQAS since score function estimators is in general ofhigh variance. In practice, batch size of O (100) shows good performance in circuit structure searching.3. Baselines: there is no theoretical guarantee that running average of objective is the best baseline to lower thevariance in Monte Carlo estimations. Therefore, new baselines and even new methods to control variance areworth exploring.4. Encoding scheme: as we have seen in examples, diﬀerent encoding schemes of basic unitary blocks matters inQAS. Therefore, domain speciﬁc and expressive encoding scheme beyond simple gate sets, such as the layer andblock encoding discussed in the main text, are highly desired for a broader set of applications.5. Probabilistic model: The probabilistic model for DQAS can be more sophisticated to better characterize thecorrelation between layers of circuits. Exploring energy-based models or autoregressive models is a promisingfuture direction for DQAS.6. Regularization terms: It is interesting to try and add other regularizations and reward terms into the objectivesin order to address multiple objectives such as hardware restriction and quantum noise reduction in the circuitdesign.7. Circuit parameter reusing mechanism: Since the theoretical framework for DQAS is general and can easily gobeyond DARTS, we may also explore novel parameter reusing mechanisms beyond the na¨ıve ones based on thevanilla super network viewpoint.5The following hyperparameter settings are assumed unless explicitly stated otherwise, • No prethermalization for circuit parameters θ . • Optimizer for probabilistic model parameters α and circuit parameter θ : Adam optimizer with learning step0 . • Initializations: Standard normal distribution for circuit parameters and all zero for probabilistic parameters. • Other techniques: No regularization terms or noise for circuit parameters by default.Note that we do not carry out any extensive search for optimal hyperparameter settings. Hence, there is no guaranteethat hyperparameters listed below are optimal for corresponding tasks.

State preparation circuit for GHZ . Primitive operation pools: parameterized R y gate on qubits 0,1,2, CNOTon (0,1); (1,0); (1,2); (2;1) since we consider circuit topology with the nearest neighbor connections. Batch size is128. Initializer for circuit parameters are zero initializer. Optimizer for α : Adam with 0 .

15 learning rate.

Bell state circuit.

Primitive operator pool:

X, Y, H and CNOT gates on each qubit. Note the whole set of operatorsin the pool is trainable parameter free. Batch size: 128. Optimizer for α : Adam with 0.15 learning rate. QEM on QFT-3 circuit.

Primitive operation pool includes discrete gates: X, Y, Z, S, T, and I (the identity gate). S = diag (1 , i ) and T = diag (1 , e π i ). The I gate must always be in the pool as it stands for leaving the qubits in anidle state. Batch size is 256.The objective for this QEM task is to maximize the ﬁdelity between noisy output of DQAS-designed quantumcircuit and the ideal output. In principle , it should be evaluated from a batch of diﬀerent input states for eachcircuit. However, as observed in numerical tests, the standard deviation of the ﬁdelity between noisy and ideal circuitfor diﬀerent input states is small. Therefore, following the spirit of the stochastic gradient descent, the ﬁdelity of suchcircuit is only evaluated for one random input state of each circuit in one epoch. Such a random input state is drawnfrom the random Haar measure, and this can be partially achieved by a short-depth circuit denoted as the unitary2-design [49]. We utilize 4 blocks repetition of unitary 2-design as input states preparation circuit by default. Sincewe only want to evaluate the noise in the QFT circuit, we assume the preparation circuit for the random input statesnoiseless. QEM on QFT-4 circuit.

For the QFT-4 circuit in Fig. 3(a) in the main text, there are 12 slots in total. Therefore,we set p = 12 for this DQAS design. Since the search space is very large, the search tends to be trapped in a localminimum. Nevertheless, the designed (and potentially sub-optimal) circuits usually outperforms the bare QFT-4circuit in terms of the ﬁdelity. To reduce the multi-start number, we can restrict the search space by limiting thenumber of single-qubit gates in the pool. Knowledges on the relevant set of single-qubit gates for such a task can belearned from similar examples such as a QFT-3 case. This procedure utilizing prior knowledge observed in smallersystems and pruning of possible operators follows the philosophy of the progressive training for DARTS [46] as wellas the idea of transfer learning. In particular, in this study, the operation pool contains the I placeholder, Z gate, Z , Z , S gate and T gate. Note that there is another trick that may further increase the probability of ﬁnding highlynontrivial QEM circuits. The idea is to exclude the I placeholder from the operation pool. Without the I placeholder,we force the DQAS engine to ﬁll every gap in the circuit while attempting to maintain a high ﬁdelity. In this way,we encourage DQAS to ﬁnd nontrivial ﬁlling pattern containing long-range correlations of added gates in a QEMcircuit and we can recover the nontrivial QEM circuits more easily as shown in the main text. The batch size is 256.Optimizer for α is Adam with learning rate 0 .

03 to 0 . Typical setups for QAOA ansatz searching.

We consider both small and large operation pools for the layerencoding. The small one includes rO-layer with O = x, y, z , zz-layer and H-layer. The large one also includes thexx-layer and yy-layer. We have also tested an extra large operation pool with NNN-layers. Namely, we also considerHamiltonians in the form of ZZ, XX, and YY that couples pairs of next nearest neighbors on the underlying graph. Wehave reproduce QAOA type layout successfully in such extra large pool. For block encoding scheme, the operator poolincludes H-layer, rx-zz-block, zz-ry-block, zz-rx-block, zz-rz-block, xx-rz-block, yy-rx-block, rx-rz-block. For example,zz-ry-block represents for operation on the circuits as e iθ m (cid:80) (cid:104) ij (cid:105) Z i Z j e − iθ n (cid:80) i X i .For the setup of reduced ansatz searching, the operation pool includes H-layer and r-O layers as well as zz-layers butnow with diﬀerent sets of edges. We often include 8 to 12 diﬀerent subgraphs from the problem graph instance. Andeach of them only have a small part of edges as the original graph. These subgraphs can be chose randomly and ingeneral O (10) of them is well enough to search for some better ansatz than plain QAOA. However, the caveat is that:6 X Rz( θ ) X FIG. A2. The circuit construction for e − iθ Z Z by CNOT and R z gate. For implementation of e − iθ X X and e − iθ Y Y , theonly change is to pretend and append Hadamard gate or R x ( ± π ) gate on both sides and each qubit lines. This is the keybuilding block for QAOA layers. the number of such reduced graph based zz-layer may have to be larger with graph of more nodes/problems involvingmore qubits. It is more interesting that the reduced graph for these new zz-layers are not necessarily exact subgraphof the problem under investigation. These reduced graph can also have random edges and random weights on them,and the result found by DQAS can be as good as subgraph ansatz sometimes. The ablation study on reduced graphinstances design is an interesting future direction.We have tried various combinations of ingredient in QAOA ansatz searching. Some are of particular value including:large batch size, typically 64 to 512; noise on circuit parameters in simulation, typically independent noise on eachparameters in the pool as zero centered Gaussian distribution with standard deviation 0 .

2. Diﬀerent objectives,of which CVaR gives promising result apart from conventional energy expectation objectives. And when CVaR isconcerned, these is no need to add trainable parameter noise in general. Initializers for circuit parameters, Gaussianinitializer with narrow width and the mean value around 0 . . . α with learning rate 0 .

15 to 0 . .

005 to 0 .

05 thatdepends on batch size. We also tried L-BGFS and Nelder-Mead optimizers for circuit parameters updates and see noobvious improvements in terms of ﬁnal objective values. Fixing header operator as traversal Hadamard gates boostthe training but it is not necessary: the initialization with ⊗ n H can also be auto found via DQAS itself. Penaltyterms as in Eq. (13) with λ around 0 . . λ = 0 .

01. Such term is particularly usefulto avoid early attraction by xx-layer and yy-layer (or one can simply drop xx-layer and yy-layer from the beginningas they are shown to be redundancy in the main text).Other techniques include those discussed in the main text: top-k grid search postprocessing and multi start may benecessary to ﬁnd optimal structure as the energy landscape of such search is rather complicated and lots of lower pQAOA layouts serve as local minimum traps. For example, if we carry out DQAS for 5 layers, we will often end in anarchitecture equivalent to P = 1 QAOA instead of P = 2. The reason behind that is the performance improvementwith deeper QAOA layout is slight and the global minima is vagued by lots of local minimum with similar objectivevalues. Block encoding with two layers combination as primitive operators in the pool is much better to train thanlayer encoding scheme and mitigate the above problem. On the contrary, novel objectives such as Gibbs objective thatshows sharper energy landscape in circuit parameter space are not suitable in DQAS when bi-optimization dominatesand ﬂatten energy landscape helps. It is also worth mentioning that, in ensemble learning setup, the dataset/graphinstances set is not pre-determined. Instead the graph instance is generated on the ﬂy in DQAS training and inprincinple one can experience all graph instances of one ensemble as long as the searching epochs is large enough.This design is better than training on ﬁxed dataset of ensemble of graph instances, which tends to over ﬁtting.Construction of QAOA primitive layers with native quantum gates set as follows: The single-qubit layer speciﬁed inthe form of e − iθ (cid:80) i O i is just a single-qubit rotation gate r O . The layer of two-qubit gates is of the form e iθ (cid:80) (cid:104) ij (cid:105) O i O j = (cid:81) (cid:104) ij (cid:105) e iθO i O j , and each term e iθO i O j can be implemented as in Fig. A2. If O is not Z, then basis-rotation gates(Hardmard H for O = x , R x ( ± π ) for O = Y ) are attached on both sides of the circuit. From this perspective, xx-layer and yy-layer are actually redundant since they can be exactly implemented by using H-layer+zz-layer+H-layeror rx-layer+zz-layer+rx-layer, respectively. Therefore we can safely drop the xx-layer and yy-layer from the operationpool. D. Further results on QEM of QFT circuit

Fidelity results for bare circuit, na¨ıve QEM circuit and nontrivial QEM circuit by DQAS.

The noisemodel for the backend quantum simulation is described in the main text. The bit-ﬂip errors are randomly insertedbetween adjacent circuit layers. Whenever there is an empty slot (idel state) in the circuit, the error rate is larger.7

HZZ SZZ T H

Z^(1/8)

S T HZ SZZ HZ (a) H Z^(2/3)

Z S

Z^(2/3)

Z T H

Z^(2/3) Z^(1/8)

S T H

Z^(2/3) SZ Z^(2/3) HZ Z^(2/3) (b)

FIG. A3. Some human design gate inserting policies for QEM on QFT-4 circuit. (a) The na¨ıve Pauli pair inserting wheneverpossble. (b) The advance inserting which tries to collectively return identity in each gap except from single holes. DQAS foundQEM circuit can outperform these circuits in terms of ﬁdelity.

HSZT SZZ T HS

Z^(1/8)

S T H

Z^(2/3) S Z^(2/3)

T HT

Z^(-2/3) Z (a) HZZT SSZ T HS

Z^(1/8)

S T HZ SZS HTZS (b)

FIG. A4. Some optimal QEM circuit found by DQAS. They shows similar performance in ﬁdelity as the one given in the maintext.

Next, we not only present a comparative study between QEM circuits found by DQAS and bare circuit, but alsopresent a comparative study of QEM circuit designed by DQAS and QEM circuit inspired by theoretical methods.For QFT-3 circuit, the typical textbook circuit only gives a ﬁdelity of 0 .

33 in the presence of the bit-ﬂip errors; naiveQEM circuit with additions of pairs of Pauli gates on qubit 0 and 2 gives an ameliorated ﬁdelity of 0 .

55 (diﬀerentchoices of Pauli gates only yield a small diﬀerences in the ﬁdelity). QEM circuit found by DQAS gives a furtherimproved ﬁdelity of 0 . .

13. Again, the na¨ıve QEM circuit, where asmany pairs of Pauli gates as possible are added to ﬁll empty slots (see Fig. A3(a)), gives a ﬁdelity of 0 .

3. Thereis another type of circuit-ﬁlling policy (see Fig. A3(b)), where we do nothing to single empty slots in a circuit, andinsert a series of Z α i that collectively give an identity into contiguous empty slots (spanning across more than 2layers). This strategy allows one to ﬁll empty slots in the circuit except those isolated ones restricted to a singlecircuit layer. QEM circuit designed under this policy gives a ﬁdelity around 0 .

41. On the other hand, DQAS discoversmany distinct conﬁgurations of QEM circuit for the QFT-4 case, and the associated ﬁdelities are usually found to bein the range of 0 . ∼ .

46. Clearly, automated circuit designs outperform the ones recommended by human-designedand sophisticated empty-slot-ﬁlling policy .

More QEM circuits with similar ﬁdelity for QFT-4.

In QFT-4 case, DQAS also ﬁnds various circuits of similarﬁdelity as the optimal one in the main text, we present some of them in Fig. A4.

E. Further results on QAOA ansatz searching

Multiple objectives with hardware consideration.

Additional considerations may be taken into account duringthe search for an ideal circuit design for the MAXCUT problem. Suppose that we still work with the layer-encodingoperation pool given above but with xx-layer, yy-layer explicitly considered. Furthermore, we suppose the backendquantum hardware is equipped with the primitive gates of rO, H, and CNOT. Therefore, every circuit layer has to betranslated into this native gate set. To design resource-eﬃcient quantum circuit by DQAS, we may add the followingpenalty term to incorporate consideration of resource limitation and quantum error mitigation into the mix, ∇L = λ p (cid:88) i =1 (cid:88) k ∈ c p ( k i = k ) ω ( k ) , (A3)8where ω (CNOT) = 2, ω (rO) = 1 and ω (H) = 1. Given the above costs for each gate, one can easily show that ω (xx-layer) = 27 / Block encoding for QAOA search.

When dealing with large-scale circuits, it may become progressively morechallenging to discover an optimal structure for MAX CUT problems with DQAS. This is partially related to ourdecision to utilize the simplest probabilistic model, where no explicit correlations about choices of consecutive layersare taken into account of. It might be useful to add these correlations into the design. For instance, in the standardQAOA circuit, an rx-layer is always followed by a zz-layer.Acknowledging the usefulness of such a complex block that contains at least two primitive layers, we introduce theblock encoding. Namely, the primitives in the operation pool are of the form e iθ ZZ e − iθ X . In other words, the blockencoding deals with various combinations of basic operations like zz-rx-block, yy-rz-block and so on. Same as before,it is useful to keep the Hadamard H-layer in the pool. Via this block encoding, we easily discover the standard QAOAlayout for P = 3 as H-layer, zz-rx-block ∗ Proxy tasks and transfer learning.

We mainly apply DQAS to design QAOA-like ansatz for systems consistingof 8 or 10 qubits. We can think of these automated designs, conducted for small systems, as proxy tasks since largesystems (containming more qubits) often share the same optimal circuit patterns with small systems of the samefamily of problem like MAXCUT. Then we can ﬁx the circuit architecture and only optimize circuit parameters forlarger systems as in the standard QAOA algorithm.

Other competitive and near optimal layouts found by DQAS.

In the study of a bunch of random regulargraphs, DQAS not only ﬁnds QAOA circuits but also discovers other circuit designs with comparable or slightlyinferior performance. More speciﬁcally, via an ensemble made of n = 8 regular graphs with degree 3, we found analternative architecture comprising of H-layer, yy-layer, rx-layer, zz-layer, rx-layer, which gives an expectation valueof MAX CUT around 8 . Layerwise training.

It becomes increasingly challenging to ﬁnd an optimal ansatz for deeper circuit, since there aremultiple local minima on this rugged energy landscape for MAXCUT problems. One way to understand the challengeis to note that all QAOA circuits that eﬀectively are constituted of P − P blocks QAOA.When the P -block optimal circuit delivers a marginal performance gain with respect to the ( P − p . Toresolve this challenge, the idea of layer-wise progressive training should be adopted [73]. Note that multiple-starttechnique is necessary due to the hardness of ﬁnding optimal circuits when the circuit depth is large. Early stoppingas well as top-2 grid search are also found to be useful. Erd¨os-R´enyi graph results.

In this study, we also try out DQAS design for MAXCUT problems with the graphsdrawn from the Erd¨os - R´enyi ensemble. Moe speciﬁcally, we consider two ensembles characterized, respectively, with n = 10 , p = 0 . n = 8 , p = 0 . Diﬀerent objectives beyond an average value.

Since the aim of QAOA is to estimate the lowest energy insteadof an average one, it seems to be natural to use objectives that may more accurately reﬂect the overlap betweenthe prepared quantum state and the ground state instead of the average energy. To this end, there are some newlyproposed objectives such as the CVaR [48] and Gibbs objective [37]. These objectives can be easily handled withinthe framework of DQAS as in the form of Eq. (3) in the main text. For example, for Gibbs objective, we simplyset f ( x ) = e − λx and g ( x ) = ln x . Gibbs objective is supposed to reward the prepared quantum state to have ahigher overlap with lower energy states. However, Gibbs objective is found to introduce a very steep landscape withrespect to circuit parameters. Gibbs objective may be useful when the task only requires to search for optimal circuitparameters for a given circuit architecture. However, in a typical DQAS-task, we need to identify an optimal circuitdesign (i.e. simultaneously ﬁnding an architecture and related circuit parameters), the sharp landscape of Gibbs9 FIG. A5. One of the instance learning unweighted graph for MAX CUT problem. The MAX CUT value for this graph is 10.We can found better quantum architecture than QAOA of the same depth to approximate MAX CUT value in such instances. objective presents a non-trivial challenge. In particular, the circuit parameters in DQAS-designed circuits frequentlydeviate from the optimal circuit parameters in the searching process due to the complications of the existence of asuper network structure, the minimum of Gibbs objective is hence vague.On the contrary, CVaR seems to be a good objective to try with DQAS. CVaR actually measures the mean energyof only a proportion of samples (say 20%) having the lowest energies. This objective gives a much smoother energylandscape than that of Gibbs objective by nature.In short, Gibbs objective is not compatible with the DQAS framework, but CVaR objective shows promisingpotential in our study. The reduced ansatz search on weighted graphs, mentioned in the main text and furtherdescribed in the next section of this supporting information, is actually based on CVaR objective. We leave detailedcomparison and ablation study on diﬀerent choices of objectives as a future work.

Instance learnings.

As we have discussed in the main text, we may design an optimal circuit for individualproblems (dubbed as the instance learning) or an ensemble of problems. It is quite obvious that highly customizedcircuit architecture, adapting speciﬁcs of a particular graph instance, may outperform a generic QAOA layout whenthe circuit depth is restricted. In this section, we report our study on using DQAS to design ideal circuits for individualinstances of the MAXCUT problem.For example, DQAS recommends an optimal circuit composed of yy-layer, zz-layer and yy-layer that gives anexpected energy of − .

0, whereas the vanilla QAOA circuit of the same depth only gives an average energy of − . Reduced graph ansatz searching on instance learning.

Finally, we elucidate the study of ansatz searchingwith reduced graphs introduced in the main text. We provide two examples. The ﬁrst task is to design optimalansatz circuit for an unweighted graph (shown in Fig. A6), and adopt the mean energy of the proposed circuits asthe training objective. An optimal circuit, found by DQAS, for this speciﬁc graph is rx-layer, zz-layer, zz-layer,ry-layer and rx-layer. Recall the zz-layers are generated with Hamiltonian with restricted connectivity sampled fromthe original graph as explained in the main text. Figure A6 also gives the two subgraphs used to generate the 2ndand 3rd layer of ZZ Hamiltonians in this optimal design. Finally, we quote the MAX CUT estimated by the optimalcircuit and the vanilla QAOA for comparison. For this speciﬁc graph, the exact MAXCUT is 12. The optimal reducedansatz by DQAS and the P = 1 , P = 1 vanilla QAOA while the performance is better than deeper QAOA layouts.The second example is to design circuits for a weighted graph with weights distributed as Gaussian distribution N (1 , . P = 1 , (b) reduced graph 1 (c) reduced graph 2(a) base graph FIG. A6. Reduced graph ansatz for unweighted graph case (all weights are unity). (a) Base graph for MAX CUT and (b)(c)reduced graph found by DQAS in reduced ansatz layout which outperforms P = 2 plain QAOA. (a) base graph (b) reduced graph 1 (c) reduced graph 2 FIG. A7. Reduced graph ansatz for weighted graph case. (a) Base graph for MAX CUT and (b)(c) reduced graph found byDQAS in reduced ansatz layout which outperforms P = 1 plain QAOA. vanilla QAOA circuits are 10 . .

18, 9 .