Barren plateaus preclude learning scramblers
Zoë Holmes, Andrew Arrasmith, Bin Yan, Patrick J. Coles, Andreas Albrecht, Andrew T. Sornborger
BBarren plateaus preclude learning scramblers
Zo¨e Holmes, ∗ Andrew Arrasmith, ∗ Bin Yan,
3, 2, ∗ Patrick J. Coles, Andreas Albrecht, and Andrew T. Sornborger Information Sciences, Los Alamos National Laboratory, Los Alamos, NM USA. Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM USA. Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM USA. Center for Quantum Mathematics and Physics and Department of Physics and AstronomyUC Davis, One Shields Ave, Davis CA. (Dated: October 1, 2020)Scrambling, the rapid spread of entanglement and information through many-body quantumsystems, is of fundamental importance to thermalization and quantum chaos but can be challengingto investigate using standard techniques. Recently, quantum machine learning (QML) has emergedas a promising platform to study complex quantum processes. This prompts the question of whetherQML could be used to investigate scrambling. In this Letter, we prove a no-go theorem constrainingthe use of QML to learn an unknown scrambling process. Specifically, we show that employingQML to learn an unknown scrambling unitary, for any choice of variational ansatz , will with highprobability lead to a barren plateau landscape, i.e., where the cost gradient vanishes exponentiallyin the system size. We provide numerics extending our results to approximate scramblers. Morebroadly, our no-go theorem implies that prior information is required to learn an unknown unitaryand thus places limits on the learnability of quantum processes.
Introduction.
The growth of entanglement in manybody quantum systems can rapidly distribute the infor-mation contained in the initial conditions of the systemthroughout a large number of degrees of freedom. Thisprocess is known as scrambling [1–3]. In recent yearsscrambling has proven central not only to understandingquantum chaos but also to the study of the dynamicsof quantum information [4–6], thermalization phenom-ena [7, 8], the black hole information paradox [1, 9–11],holography [12, 13], random circuits [14–16], fluctuationrelations [17, 18] and entropic uncertainty relations [19].However, the complexity of strongly-interacting many-body quantum systems makes scrambling rather chal-lenging to study analytically. Furthermore, experimentalstudies of scramblers are demanding given the difficultiesof distinguishing scrambling from decoherence and otherexperimental imperfections [10, 20].Quantum computers have recently been used as atestbed for the study of scramblers [20, 21]. A possiblefurther use of quantum computing would be to investi-gate scrambling using quantum machine learning (QML)methods [22]. Here we define QML [23] as any methodthat optimizes a parameterized quantum circuit by min-imizing a problem-specific cost function. This includesvariational quantum algorithms (VQAs) [24–40], whichare used for numerous applications. It has recently beenshown that entanglement provides a resource to expo-nentially reduce the number of training states requiredto learn a quantum process [41]. Thus one might hopethat QML could prove an effective tool to study quantumscrambling. For example, Figure 1 shows a schematic ofhow this could be done for the Hayden-Preskill thoughtexperiment [1]. ∗ The first three authors contributed equally to this work.
FIG. 1.
Learning a scrambling unitary.
Panel (a) showsthe setup of the classic Hayden-Preskill thought experimentwhere someone attempts to retrieve information thrown into ablack hole (a scrambler). If the scrambling unitary is known,then information can be retrieved. Panel (b) shows the pro-cess of attempting to learn the scrambling black hole unitary.This requires a time that is exponential in the number ofquantum degrees of freedom (qubits) due to an exponentiallyvanishing cost gradient, see Letter below. This precludes theinformation retrieval shown in (a).
However, despite the high expectations placed onQML, there remain fundamental questions concerningits scalability and breadth of applicability. Of particularconcern is the growing body of literature on the existenceof barren plateaus , i.e., regions in parameter space wherecost gradients vanish exponentially as the size of the sys-tem studied increases. This phenomenon, which severelylimits the trainability of large scale quantum neural net-works, has been demonstrated in a number of proposedarchitectures and classes of cost function [42–46].In this paper we present a no-go theorem for the useof QML to study quantum scrambling. Namely, we showthat any QML approach used to learn the unitary dy-namics implemented by a typical scrambler will exhibita barren plateau and thus be untrainable in the absence a r X i v : . [ qu a n t - ph ] S e p of further prior knowledge.In contrast to the barren plateau phenomenon estab-lished in Ref. [42], which is a consequence of the ansatzstructure and parameter initialization strategy, our bar-ren plateau result holds for any choice of ansatz and anyinitialization of parameters . Thus, previously proposedstrategies to avoid barren plateaus [47–49], such as cor-relating parameters, do not address the issue raised inour work. Our result is conceptually distinct from previ-ous barren plateau results, and additionally provides analternative perspective on trainability issues in QML.Given the close connection between chaos and random-ness, our no-go theorem also applies to learning randomand pseudo-random unitaries. As such, our result impliesthat to efficiently learn an unknown unitary process usingQML, prior information about that process is required.For this reason, our work constrains the use of QML inthe study of complex, arbitrary physical processes. Preliminaries.
To illustrate the machine learningtask we consider, let us start by recalling the famousHayden-Preskill thought experiment [1] (Fig. 1). Sup-pose Alice attempts to destroy a secret, encoded ina quantum state, by throwing it into Nature’s fastestscrambler, a black hole. How safe is Alice’s secret? Hay-den and Preskill argued that if Bob knows the unitarydynamics, U , implemented by the black hole, and sharesa maximally entangled state with the black hole, it ispossible to decode Alice’s secret by collecting a few ad-ditional photons emitted from the black hole. However,this prompts a second question, how might Bob learnthe scrambling unitary in the first place? Here we inves-tigate whether QML can be used to learn the scramblingunitary, U .To address this, we first motivate our notion of ascrambler. A diagnostic for information scrambling thathas attracted considerable recent attention is the out-of-time-ordered correlator (OTOC) [3, 13, 50], a four-pointcorrelator with unusual time ordering, f OTOC ≡ (cid:104) ˜ XY ˜ X † Y † (cid:105) . (1)Here X and Y are local operators on different subsys-tems, ˜ X = U † X (0) U is the Heisenberg evolved initialoperator X (0) and the average is taken over an infinitetemperature state ρ ∝ .For chaotic dynamics, this quantity decays rapidly andpersists at a small value. This behavior can be mademore transparent by noting that if X and Y are bothHermitian and unitary then the OTOC can be writtenas f OTOC = 1 − (cid:104) [ ˜ X, Y ][ ˜
X, Y ] † (cid:105) . (2)Since X and Y act on different subsystems, their un-evolved commutator vanishes. However, scrambling dy-namics evolve these local operators into global ones, in-ducing a growth of the commutator. In this work, we are interested in learning the evolu-tion unitary at late times, when the dynamics becomesufficiently complex for universal structures to form. Forinstance, the unitary U = exp ( − iHt ) of a chaotic Hamil-tonian H appears increasingly random with time. Specif-ically, after a time scale called the scrambling time, theOTOC of a chaotic system tends to a minimal value thatis equivalent to taking its average over a random distri-bution of unitaries [14].The link between scrambling and randomness can bemade more precise by introducing the concept of unitarydesigns. An ensemble of unitaries with distribution µ isa unitary k -design if its statistics agree with those of theHaar random distribution up to the k -th moment, i.e., iffor any X , (cid:90) µ dU U ⊗ k ( X ) U †⊗ k = (cid:90) Haar dU U ⊗ k ( X ) U †⊗ k . (3)Since f OTOC in (1) only involves the second moment ofthe unitary, its asymptotic smallness can be attributedto the fact that the scrambling unitary appears to be atypical element of a 2-design [14]. Hence, a scramblercan be modelled as a unitary that is drawn from a dis-tribution that forms at least a unitary 2-design [1, 2].We remark that the dynamics of chaotic systems beforereaching the scrambling time provide a model for ap-proximate scrambling behavior [51], which we study inour numerics below.
Main results.
Having formalized our notion of scram-bling, we are now in a position to present our main resulton the learnability of scramblers. The aim of QML is tominimize a problem-specific cost function that is evalu-ated on a quantum computer. In the context of learningunitaries, one considers an ansatz (i.e., parameterizedquantum circuit) U ( θ ) and a target unitary V . It is thennatural to define the product W ( θ ) = V † U ( θ ), whichwould be proportional to the identity in the case of per-fect training, i.e., when U ( θ ) matches the target V . Toquantify the quality of the training, one can employ ageneric cost function of the form C ( θ , V ) = (cid:104) ψ | W ( θ ) † HW ( θ ) | ψ (cid:105) , (4)where | ψ (cid:105) is some state and H is some Hermitian opera-tor. While in the main text we focus on the cost in (4),in Appendix B we extend our results to a more generalcost of the form C gen ( θ , V ) = (cid:88) i p i (cid:104) ψ i | ( W ( θ ) † ⊗ R ) H i ( W ( θ ) ⊗ R ) | ψ i (cid:105) . (5)This cost allows for multiple input states {| ψ i (cid:105)} and mea-surement operators { H i } , thereby allowing for QML ap-proaches that use training data. Moreover, the trainingdata states | ψ i (cid:105) now act not just on the scrambling sys-tem S but can also be entangled with a reference system R to reduce resource requirements [41]. We remark thatstandard cost functions for variational compiling of uni-taries [28, 32] fall under the framework of our generalcost in (5), as discussed in Appendix D.Barren plateaus have been proven [42] for a wide classof variational quantum algorithms, including those thataim to learn unitaries, whenever the training ansatz U ( θ )is sufficiently random. This result is a consequence ofthe random nature of both the ansatz structure and theinitialization of parameters, and it makes no referenceto the form of a potential target unitary to be learned.Here, we prove a complementary result. Namely, if thetarget unitary V is drawn from a sufficiently random en-semble (i.e., an ensemble that forms at least a 2-design),one also encounters a barren plateau, irrespective of thetraining ansatz or the initialization of parameters . Thus,regardless of the QML strategy that is employed, a typi-cal scrambler will manifest a barren plateau.Suppose one wants to learn an unknown target unitary V where all that is known is that it is drawn from anensemble of scramblers V , which corresponds to V form-ing a 2-design as noted above. Consider learning V byvariationally minimizing a cost C ( θ , V ) of the generalform in (4). The following proposition, which we provein Appendix A of the Supplementary Material (SM), es-tablishes that the average gradient of the cost is zero. Proposition 1.
The average partial derivative of C ( θ , V ) , with respect to any parameter θ k , for an ensem-ble of target unitaries V that form a 2-design, is given by (cid:104) ∂ θ k C ( θ , V ) (cid:105) V = 0 . (6)Proposition 1 establishes that the gradient is unbiased,but this alone does not preclude the possibility of largevariations in the gradient, and thus is insufficient to as-sess trainability. However, Chebyshevs inequality boundsthe probability that the partial derivative of the costfunction deviates from its mean value P ( | ∂ θ k C | ≥ | x | ) ≤ Var V [ ∂ θ k C ] x (7)in terms of the variance of the cost partial derivative fora typical target unitary,Var V [ ∂ θ k C ] = (cid:68) ( ∂ θ k C ( θ , V )) (cid:69) V − (cid:104) ∂ θ k C ( θ , V ) (cid:105) V . (8)As a result, a vanishingly small Var V [ ∂ θ k C ] (combinedwith a vanishing average gradient) for all θ k would implythat the probability that the cost partial derivative isnon-zero is vanishingly small for all parameters, i.e., thecost landscape forms a barren plateau.Indeed, this behavior is precisely what we find here. Asshown in Appendix A of the SM, we prove the following. Theorem 2.
Consider a generic cost function C ( θ , V ) ,Eq. (4) , to learn an n -qubit target unitary V using an ar-bitrary ansatz U ( θ ) . The variance of the partial deriva-tive of C ( θ , V ) , with respect to any parameter θ k , for an ensemble of target unitaries V that form a 2-design, isgiven by Var V [ ∂ θ k C ] = (cid:20) H ]2 n − − H ]) n (2 n − (cid:21) Var χ [ − iU ∂ θ k U † ] , (9) where Var V denotes the variance over the ensemble V ,and Var χ denotes the quantum-mechanical variance withrespect to the ansatz-evolved state | χ ( θ ) (cid:105) = U ( θ ) | ψ (cid:105) . From Theorem 2, we derive the following corollary onthe scaling of Var V [ ∂ θ k C ]. Corollary 3.
Consider a generic cost function C ( θ , V ) ,Eq. (4) , to learn an n -qubit target unitary V . Withoutloss of generality, the ansatz can be written in the form U ( θ ) = N (cid:89) i =1 U i ( θ i ) W i , (10) where { W i } is a chosen set of fixed unitaries and U i ( θ i ) =exp ( − iθ i G i ) with G i an Hermitian operator. If Tr[ H ] ∈O (2 n ) and || G k || ∞ ∈ O (1) , then Var V [ ∂ θ k C ] ∈ O (2 − n ) . (11)We note that for practical cost functions the conditionTr( H ) ∈ O (2 n ) holds. Similarly, standard ansatzes usenormalized generators and therefore it is reasonable toassume that || G k || ∞ ∈ O (1) , ∀ k . As such, we concludethat the variance in the gradient will in general vanishexponentially with n , the number of qubits in the system.We note that Appendix B extends this exponential scal-ing result to the generalized cost function in (5). Hence,these results establish that QML approaches to learn atypical target scrambling unitary, that is a unitary drawnfrom a 2-design, will exhibit a barren plateau. Numerical implementation for approximate scram-blers.
Here we extend our results by numerically study-ing approximate scramblers. For concreteness, we nowtake a dynamical perspective and model a scramblerusing a variant of the minimal model introduced inRef. [51], where the scrambling unitary V S ( g, t ) := ( V V ( g )) t (12)consists of alternating random gates and entangling lay-ers. Here our time parameter t is effectively the circuitdepth. Specifically, the first layer is composed of a seriesof random single qubit rotations V = (cid:89) i R ix ( θ ix ) R iy ( θ iy ) R iz ( θ iz ) , (13)where R ik ( θ ik ) is a rotation about the k = x, y, z axisof the i th qubit and the { θ ik } are randomly chosen anglesbetween 0 and 2 π . The second layer is a global entanglinggate V ( g ) = (cid:89) i Landscape of cost function around optimumparameters. Here we plot a random cut of the landscape ofthe LHST cost function C LHST ( U, V ) (defined in Appendix D)where V ( g, t ) is a randomly generated scrambler modelled viaEq. (12) and U ( g, t ) is an ansatz of the same form. The pa-rameter (cid:15) is a noise parameter which determines the extentto which the ansatz parameters, θ , deviate from the param-eters of the scrambler being learned, θ target . Specifically, weset θ k = θ target k + (cid:15)R , where R is a random number between-1 and 1. The landscape for weak scramblers with t = 1 and g = 0 . g = 5 are plotted in yellow and red respectively.The landscape for stronger scramblers with t = 15 and g = 0 . g = 5 are plotted in green and blue respectively. In allfour cases we consider a 9 qubit scrambler ( n = 9). where Z k is a Pauli operator on the k th qubit and n isthe total number of qubits. The degree to which V S ( t ) isscrambling increases over time t , with the rate of increasedetermined by the entangling rate g .We consider learning a scrambler modelled by V S ( g, t )using an ansatz of precisely the same structure. Thatis we generate a target scrambler V target S by randomlygenerating a set of single qubit rotation angles θ target andattempt to learn the angles using an ansatz of the form U ( θ ) = V S ( g, t ). For concreteness, we suppose the localHilbert-Schmidt cost [28], which we detail in Appendix Dof the SM, is used to learn V S .In Fig. 2 we plot a cross-section of the cost landscapein the region around the true parameters θ target that min-imize the cost. As we increase the degree to which thetarget unitary is scrambling, by increasing the durationof evolution t and the strength of the entangling gates g ,the variance in the cost visibly decreases. In the case of ahighly scrambling unitary (blue) the majority of the land-scape forms a barren plateau with only a narrow gorge where the cost dips down to its minimum. In contrastfor weaker scramblers (yellow) the valley around the min-imum is wider and the plateau more featured. This is anice visual representation of how the barrenness of thecost landscape depends on the degree to which the tar-get unitary is scrambling.In Fig. 3 we plot the variance of the cost partial deriva-tive as a function of scrambling time t (left) and system size n (right) for varying entangling rate g . For com-pleteness and for comparison, the variance is calculatedboth over an ensemble V of target unitaries (top), de-noted Var V [ ∂ θ k C ], and over an ensemble U of randomparameterized ansatzes (bottom), denoted Var U [ ∂ θ k C ].For sufficiently large t and g the target unitary is a per-fect scrambler, and the variance of the partial derivativevanishes exponentially as Var V [ ∂ θ k C ] ∝ − n . We notethat this in fact exceeds the minimal scaling predictedby Corollary 3, which is expected for typical scramblingansatzes as we discuss in Appendix C. Once the scram-bling time is sufficient for perfect scrambling, Var V [ ∂ θ k C ]saturates. For weaker scramblers Var V [ ∂ θ k C ] similarlydecreases with system size but at a slower rate. Thesame behavior is seen for Var U [ ∂ θ k C ], demonstrating aduality between averaging over targets and ansatzes.For finite-size studies, inevitably, the simulated targetensemble is not an ideal unitary design. Nevertheless, ournumerical data obeys Corollary 3. In Appendix E of theSM, we demonstrate that the exponential suppression ispreserved, as long as the target ensemble is sufficientlyclose to a unitary 2-design. Discussion. Bob’s ability to decode Alice’s secretfrom observing relatively few emitted photons in theHayden-Preskill thought experiment relies on the scram-bling nature of the black hole’s unitary dynamics. Thisensures that the information contained in Alice’s stateis quickly distributed across the black hole. Our workestablishes that, intriguingly, while essential for the de-coding protocol, the scrambling nature of the blackhole inhibits Bob’s ability to learn U in the first place.Therefore, perhaps Alice’s secret is safer than previouslythought.More generally, our result entails that irrespective ofthe choice in ansatz, a barren plateau will be encounteredwhen learning a typical random (or pseudo-random) uni-tary. Thus the no-go theorem provides a fundamentallimit on the learnability of unknown processes. Thank-fully, most physically interesting processes are sufficientlysimple or structured that they do not resemble a typicalrandom (or pseudo-random) unitary. For example, theshort time evolution of even chaotic systems forms onlya 1-design and consequently, as supported by our numer-ical results, may be learnable. Therefore the no-go the-orem does not condemn QML but rather highlights theimportance of understanding its domain of applicability.Crucially, Bob’s situation in the Hayden-Preskillthought experiment differs from other machine learningtasks. Elsewhere, he might be able to learn more infor-mation about the target unitary, and thereby devise anapproach which would avoid a barren plateau. But whenattempting to learn the dynamics of a black hole, Bobonly knows that he needs to learn a scrambler and doesnot have a way to peek behind the event horizon to theblack hole formation time, so this additional informationis inaccessible.Even when prior information on the target unitary is V a r U [ @ ✓ C ] n = 3 n = 5 n = 7 n = 9 n = 11 t = 3 t = 6 t = 9 t = 12 t = 153 6 9 12 15Time, t V a r V [ @ ✓ C ] ng = 0 . g = 1 Slope = 2 n Slope = 2 n V a r U [ @ ✓ C ] V a r V [ @ ✓ C ] FIG. 3. Variance of gradient of the cost for different degrees of scrambling and scrambler sizes. Here we plotvariance in the gradient of a single local term of the LHST cost function, C ( U, V ) (defined in Appendix D), where V ( g, t )is a scrambler modelled via Eq. (12) with g = 0 . g = 1 (solid) and U ( g, t ) is an ansatz of the same form. In thefirst (second) row the variance is calculated over an ensemble of random target (ansatz) unitaries. In the first (second) columnthe variance in the gradient is plotted as a function of t ( n ). As indicated by the dotted line, for large g and t the variance inthe gradient vanishes exponentially as Var[ ∂ θ k C ] ∝ − n . available, how best to use this information remains tobe established. For example, it could be fruitful to ex-plore iterative approaches that adapt the ansatz basedon partial information extracted about the target. Alter-natively, for certain applications, one could explore ap-proaches where the target is iteratively learned by break-ing the evolution down into shorter, more tractable, timesteps. It may further be worth investigating whether bar-ren plateaus can be avoided if a target unitary need onlybe partially rather than fully learned. More generally,while prior results [42–44] highlight the need to designclever ansatzes, our results highlight the need to con-sider the properties of the target unitary to be learned,and to carefully select the target accordingly. ACKNOWLEDGMENTS We thank Marco Cerezo for helpful discussions. Thismaterial is based upon work supported by the U.S. De-partment of Energy, Office of Science, Office of High En-ergy Physics QuantISED program under under ContractNos. DE-AC52-06NA25396 and KA2401032 (ZH, AA,PJC, AA, ATS). BY acknowledges support of the U.S.Department of Energy, Office of Science, Basic EnergySciences, Materials Sciences and Engineering Division,Condensed Matter Theory Program, and partial supportfrom the Center for Nonlinear Studies. PJC and ATS ac-knowledge initial support from the Los Alamos NationalLaboratory (LANL) ASC Beyond Moore’s Law project. [1] Patrick Hayden and John Preskill, “Black holes as mir-rors: quantum information in random subsystems,” Jour-nal of High Energy Physics , 120–120 (2007).[2] Yasuhiro Sekino and L Susskind, “Fast scramblers,”Journal of High Energy Physics , 065–065 (2008).[3] Juan Maldacena, Stephen H. Shenker, and DouglasStanford, “A bound on chaos,” Journal of High EnergyPhysics , 106 (2016).[4] Pavan Hosur, Xiao-Liang Qi, Daniel A. Roberts, andBeni Yoshida, “Chaos in quantum channels,” Journal ofHigh Energy Physics , 4 (2016).[5] Adam Nahum, Jonathan Ruhman, Sagar Vijay, andJeongwan Haah, “Quantum entanglement growth un-der random unitary dynamics,” Phys. Rev. X , 031016(2017).[6] Winton Brown and Omar Fawzi, “Decoupling with ran-dom quantum circuits,” Communications in Mathemati-cal Physics , 867–900 (2015). [7] R. J. Lewis-Swan, A. Safavi-Naini, J. J. Bollinger, andA. M. Rey, “Unifying scrambling, thermalization and en-tanglement through measurement of fidelity out-of-time-order correlators in the dicke model,” Nature Communi-cations , 1581 (2019).[8] Markus J. Klug, Mathias S. Scheurer, and J¨orgSchmalian, “Hierarchy of information scrambling, ther-malization, and hydrodynamic flow in graphene,” Phys.Rev. B , 045102 (2018).[9] Beni Yoshida and Alexei Kitaev, “Efficient decod-ing for the Hayden-Preskill protocol,” arXiv e-prints ,arXiv:1710.03363 (2017), arXiv:1710.03363 [hep-th].[10] Beni Yoshida and Norman Y. Yao, “Disentangling scram-bling and decoherence via quantum teleportation,” Phys.Rev. X , 011006 (2019).[11] K. A. Landsman, C. Figgatt, T. Schuster, N. M. Linke,B. Yoshida, N. Y. Yao, and C. Monroe, “Verified quan-tum information scrambling,” Nature , 61–65 (2019). [12] Stephen H. Shenker and Douglas Stanford, “Black holesand the butterfly effect,” Journal of High Energy Physics , 67 (2014).[13] A. Kitaev, “A simple model of quantum holography,”(2015), proceedings of the KITP Program: Entanglementin Strongly-Correlated Quantum Matter (Kavli Institutefor Theoretical Physics, Santa Barbara).[14] Daniel A. Roberts and Beni Yoshida, “Chaos and com-plexity by design,” Journal of High Energy Physics ,121 (2017).[15] Jordan Cotler, Nicholas Hunter-Jones, Junyu Liu, andBeni Yoshida, “Chaos, complexity, and random matri-ces,” Journal of High Energy Physics , 48 (2017).[16] Bruno Bertini and Lorenzo Piroli, “Scrambling in ran-dom unitary circuits: Exact results,” Phys. Rev. B ,064305 (2020).[17] A. Chenu, I. L. Egusquiza, J. Molina-Vilaplana, andA. del Campo, “Quantum work statistics, loschmidt echoand information scrambling,” Scientific Reports , 12634(2018).[18] Nicole Yunger Halpern, “Jarzynski-like equality for theout-of-time-ordered correlator,” Phys. Rev. A , 012120(2017).[19] Nicole Yunger Halpern, Anthony Bartolotta, and Ja-son Pollack, “Entropic uncertainty relations for quantuminformation scrambling,” Communications Physics , 92(2019).[20] K. A. Landsman, C. Figgatt, T. Schuster, N. M. Linke,B. Yoshida, N. Y. Yao, and C. Monroe, “Verified quan-tum information scrambling,” Nature , 61–65 (2019).[21] Bin Yan and Nikolai A Sinitsyn, “Recovery of dam-aged information and the out-of-time-ordered correla-tors,” Physical Review Letters , 040605 (2020).[22] John Preskill, “Quantum Computing in the NISQ eraand beyond,” Quantum , 79 (2018).[23] Jacob Biamonte, Peter Wittek, Nicola Pancotti, PatrickRebentrost, Nathan Wiebe, and Seth Lloyd, “Quantummachine learning,” Nature , 195–202 (2017).[24] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q.Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien,“A variational eigenvalue solver on a photonic quantumprocessor,” Nature Communications , 4213 (2014).[25] Jarrod R McClean, Jonathan Romero, Ryan Babbush,and Al´an Aspuru-Guzik, “The theory of variationalhybrid quantum-classical algorithms,” New Journal ofPhysics , 023023 (2016).[26] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann,“A Quantum Approximate Optimization Algorithm,”arXiv e-prints , arXiv:1411.4028 (2014), arXiv:1411.4028[quant-ph].[27] J. Romero, J. P. Olson, and A. Aspuru-Guzik, “Quan-tum autoencoders for efficient compression of quan-tum data,” Quantum Science and Technology , 045001(2017).[28] Sumeet Khatri, Ryan LaRose, Alexander Poremba,Lukasz Cincio, Andrew T Sornborger, and Patrick JColes, “Quantum-assisted quantum compiling,” Quan-tum , 140 (2019).[29] R. LaRose, A. Tikku, ´E. O’Neel-Judy, L. Cincio, andP. J. Coles, “Variational quantum state diagonalization,”npj Quantum Information , 1–10 (2018).[30] Andrew Arrasmith, Lukasz Cincio, Andrew T Sorn-borger, Wojciech H Zurek, and Patrick J Coles, “Vari- ational consistent histories as a hybrid algorithm forquantum foundations,” Nature communications , 1–7(2019).[31] Marco Cerezo, Alexander Poremba, Lukasz Cincio, andPatrick J Coles, “Variational quantum fidelity estima-tion,” Quantum , 248 (2020).[32] Kunal Sharma, Sumeet Khatri, Marco Cerezo, andPatrick J Coles, “Noise resilience of variational quantumcompiling,” New Journal of Physics , 043006 (2020).[33] Carlos Bravo-Prieto, Ryan LaRose, M. Cerezo, YigitSubasi, Lukasz Cincio, and Patrick J. Coles, “Variationalquantum linear solver: A hybrid algorithm for linear sys-tems,” arXiv:1909.05820 (2019).[34] M Cerezo, Kunal Sharma, Andrew Arrasmith, andPatrick J Coles, “Variational quantum state eigensolver,”arXiv preprint arXiv:2004.01372 (2020).[35] Kentaro Heya, Ken M Nakanishi, Kosuke Mitarai, andKeisuke Fujii, “Subspace variational quantum simula-tor,” arXiv preprint arXiv:1904.08566 (2019).[36] Cristina Cirstoiu, Zoe Holmes, Joseph Iosue, Lukasz Cin-cio, Patrick J Coles, and Andrew Sornborger, “Vari-ational fast forwarding for quantum simulation beyondthe coherence time,” npj Quantum Information , 1–10(2020).[37] Benjamin Commeau, Marco Cerezo, Zo¨e Holmes, LukaszCincio, Patrick J Coles, and Andrew Sornborger,“Variational hamiltonian diagonalization for dynamicalquantum simulation,” arXiv preprint arXiv:2009.02559(2020).[38] Ying Li and Simon C Benjamin, “Efficient variationalquantum simulator incorporating active error minimiza-tion,” Physical Review X , 021050 (2017).[39] Suguru Endo, Ying Li, Simon Benjamin, and Xiao Yuan,“Variational quantum simulation of general processes,”arXiv preprint arXiv:1812.08778 (2018).[40] Xiao Yuan, Suguru Endo, Qi Zhao, Ying Li, and Si-mon C Benjamin, “Theory of variational quantum simu-lation,” Quantum , 191 (2019).[41] Kunal Sharma, M. Cerezo, Zo¨e Holmes, Lukasz Cin-cio, Andrew Sornborger, and Patrick J. Coles, “Refor-mulation of the No-Free-Lunch Theorem for EntangledData Sets,” arXiv e-prints , arXiv:2007.04900 (2020),arXiv:2007.04900 [quant-ph].[42] Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy,Ryan Babbush, and Hartmut Neven, “Barren plateausin quantum neural network training landscapes,” NatureCommunications , 4812 (2018).[43] M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio,and Patrick J. Coles, “Cost-Function-Dependent BarrenPlateaus in Shallow Quantum Neural Networks,” arXive-prints , arXiv:2001.00550 (2020), arXiv:2001.00550[quant-ph].[44] Kunal Sharma, M. Cerezo, Lukasz Cincio, andPatrick J. Coles, “Trainability of Dissipative Perceptron-Based Quantum Neural Networks,” arXiv e-prints ,arXiv:2005.12458 (2020), arXiv:2005.12458 [quant-ph].[45] Samson Wang, Enrico Fontana, M Cerezo, KunalSharma, Akira Sone, Lukasz Cincio, and Patrick J Coles,“Noise-induced barren plateaus in variational quantumalgorithms,” arXiv preprint arXiv:2007.14384 (2020).[46] M Cerezo and Patrick J Coles, “Impact of barren plateauson the hessian and higher order derivatives,” arXivpreprint arXiv:2008.07454 (2020).[47] Edward Grant, Leonard Wossnig, Mateusz Ostaszewski, and Marcello Benedetti, “An initialization strategy foraddressing barren plateaus in parametrized quantum cir-cuits,” Quantum , 214 (2019).[48] Guillaume Verdon, Michael Broughton, Jarrod R Mc-Clean, Kevin J Sung, Ryan Babbush, Zhang Jiang, Hart-mut Neven, and Masoud Mohseni, “Learning to learnwith quantum neural networks via classical neural net-works,” arXiv preprint arXiv:1907.05415 (2019).[49] Tyler Volkoff and Patrick J. Coles, “Large gradi-ents via correlation in random parameterized quan-tum circuits,” arXiv e-prints , arXiv:2005.12200 (2020),arXiv:2005.12200 [quant-ph].[50] A. Larkin and Y. N. Ovchinnikov, “Quasiclassicalmethod in the theory of superconductivity,” Sov. Phys.JETP , 1200 (1969).[51] Ron Belyansky, Przemyslaw Bienias, Yaroslav A.Kharkov, Alexey V. Gorshkov, and Brian Swingle, “Aminimal model for fast scrambling,” arXiv e-prints ,arXiv:2005.05362 (2020), arXiv:2005.05362 [quant-ph].[52] Bin Yan, Lukasz Cincio, and Wojciech H Zurek, “Infor-mation scrambling and loschmidt echo,” Physical ReviewLetters , 160603 (2020).[53] Fernando G S L Brand˜ao, Aram W Harrow, and Micha(cid:32)lHorodecki, “Local random quantum circuits are approx-imate Polynomial-Designs,” Communications in Mathe-matical PHysics , 397–434 (2016). Supplementary Material for“Barren plateaus preclude learning scramblers” Appendix A: Proofs of results presented in the main text1. Proof of Proposition 1 Proof. We start by switching the order of the average and the derivative, (cid:104) ∂ θ k C ( θ , V ) (cid:105) V = (cid:90) dV ∂ θ k C ( θ , V )= ∂ θ k (cid:90) dV Tr (cid:2) U † ( θ ) V HV † U ( θ ) ρ (cid:3) , (A1)where ρ = | ψ (cid:105)(cid:104) ψ | . The Haar integral over V can be evaluated using the identity (cid:90) dV Tr[ V AV † B ] = Tr[ A ] Tr[ B ] d , (A2)where d = 2 n is the dimension of V . Thus we are left with (cid:104) ∂ θ k C ( θ , V ) (cid:105) V = Tr[ H ]2 n ∂ θ k Tr [ ρ ]= 0 , (A3)as claimed. 2. Proof of Theorem 2 Proof. The variance of the cost function C ( θ , V ) = (cid:104) ψ | U † ( θ ) V HV † U ( θ ) | ψ (cid:105) (A4)is given by Var V [ ∂ θ k C ] ≡ (cid:90) dV ( ∂ θ k C ) = (cid:90) dV Tr (cid:20) ∂U † ( θ ) V HV † U ( θ ) ρ∂θ k (cid:21) , (A5)where ρ = | ψ (cid:105)(cid:104) ψ | . Note that we have used the result of Proposition 1 to neglect the squared mean term. This integralcan be evaluated using the identity (cid:90) dV Tr (cid:2) V † AV BV † DV E (cid:3) = Tr A Tr D Tr[ BE ] + Tr[ AD ] Tr B Tr Ed − − Tr A Tr D Tr B Tr E + Tr[ AD ] Tr[ BE ] d ( d − , (A6)to obtain Var V [ ∂ θ k C ] = (cid:20) H ]2 n − − H ] n (2 n − (cid:21) Var χ [ − iU ∂ θ k U † ] , (A7)where the variance on the RHS is evaluated with respect to the state | χ (cid:105) = U | ψ (cid:105) . 3. Proof of Corollary 3 Proof. The term Var χ [ − iU ∂ θ k U † ] can be evaluated by considering a layered, parameterized circuit structure of theform U ( θ ) = N (cid:89) i =1 U i ( θ i ) W i . (A8)Here { W i } is a chosen set of fixed unitaries and U i ( θ i ) = exp ( − iθ i G i ) where G i is a Hermitian operator. Nextconsider a bipartite cut made at the k th layer of this circuit structure and write U ( θ ) ≡ U kL ( θ ) U kR ( θ ) (A9)where U kL ( θ ) = N (cid:89) i = k +1 U i ( θ i ) W i and U kR ( θ ) = k (cid:89) i =1 U i ( θ i ) W i . (A10)The term Var χ [ − iU ∂ θ k U † ] evaluates to Var χ [ − iU ∂ θ k U † ] = Var χ kR [ G k ] (A11)where Var χ kR [ G k ] is the variance of G k with respect to the state | χ kR (cid:105) = U kR | ψ (cid:105) . Thus we have that the variance in thecost is given by Var V [ ∂ θ k C ] = (cid:20) H ]2 n − − H ]) n (2 n − (cid:21) Var χ kR [ G k ] . (A12)We further note that Var χ kR [ G k ] can be bounded as followsVar χ kR [ G k ] = (cid:104) χ kR | G k | χ kR (cid:105) − (cid:104) χ kR | G k | χ kR (cid:105) ≤ (cid:104) χ kR | G k | χ kR (cid:105) ≤ || G k || ∞ (A13)where || X || ∞ denotes the infinity norm of X , i.e. its largest eigenvalue. Therefore if || G k || ∞ ∈ O (1), it follows thatVar χ kR [ G k ] ∈ O (1). Assuming additionally that Tr[ H ] ∈ O (2 n ), from which it follows that Tr[ H ] ∈ O (2 n ), givesVar V [ ∂ θ k C ] ∈ O (2 − n ) , (A14)as claimed. Appendix B: Extension of main results to the generalized cost C gen In this section we extend our results on the exponential suppression for the cost function gradient to a generalizedcost function of the form, C gen ( θ , V ) = (cid:88) i p i (cid:104) ψ i | ( U † ( θ ) V ⊗ R ) H i ( V † U ( θ ) ⊗ R ) | ψ i (cid:105) . (B1)The weighted sum with (cid:80) i p i = 1 accounts for multiple training data points. While U † ( θ ) V acts on a 2 n dimensionalsystem S , the training data { H i , | ψ i (cid:105)} can be additionally entangled with an arbitrary sized reference system R . Suchentangled training data was proven beneficial for quantum machine learning in Ref. [41].In analogy to results shown in the main text, we have proven the following propositions and theorems on the costlandscape of C gen ( θ , V ). Proposition 4. The average partial derivative of the general cost C gen ( θ , V ) , with respect to any parameter θ k , foran ensemble of target unitaries V that form a 2-design, is given by (cid:104) ∂ θ k C gen ( θ , V ) (cid:105) V = 0 . (B2) Theorem 5. Consider the general cost function C gen ( θ , V ) to learn an n -qubit target unitary V . Without lossof generality, the ansatz can be written in the form U ( θ ) = (cid:81) Ni =1 U i ( θ i ) W i where { W i } is a chosen set of fixedunitaries and U i ( θ i ) = exp ( − iθ i G i ) with G i an Hermitian operator. Let us decompose the measurement operators as H i = (cid:80) j q j H Sij ⊗ H Rij , (cid:80) j q j = 1 , where S and R label the corresponding Hilbert subspace in the tensor product U † V ⊗ .Let w ijk denote the eigenvalues of H Rij . If Tr[ H Si H Sj ] ∈ O (2 n ) , Tr[ H Si ] ∈ O (2 n ) , w ij ∈ O (1) and || G k || ∞ ∈ O (1) , then Var V [ ∂ θ k C gen ] ∈ O (2 − n ) . (B3) 1. Proof of Proposition 4 Proof. Switching the order of the average and the derivative, the averaged gradient reads, (cid:104) ∂ θ k C gen ( θ , V ) (cid:105) V = (cid:90) dV ∂ θ k C gen ( θ , V )= ∂ θ k Tr (cid:34)(cid:88) i p i (cid:90) dV ( U † ( θ ) V ⊗ R ) H i ( V † U ( θ ) ⊗ R ) ρ i (cid:35) , (B4)where ρ i = | ψ i (cid:105)(cid:104) ψ i | . Each integral term can be performed using the identity for the Haar average over subsystems[52], (cid:90) dV ( V ⊗ R ) A ( V † ⊗ R ) B = S ⊗ Tr S [ A ] d S B . (B5)Here Tr S is a partial trace over the system S acts on with dimension d S = 2 n . This gives (cid:90) dV ( U † ( θ ) V ⊗ R ) H i ( V † U ( θ ) ⊗ R ) ρ i = S ⊗ Tr S [ H i ]2 n U † ( θ ) U ( θ ) ρ i = S ⊗ Tr S [ H i ]2 n ρ i . (B6)As such the integral is θ -independent and therefore the averaged gradient vanishes. 2. Proof of Theorem 5 Proof. Without loss of generality, each measurement operator H i can be decomposed as H i = (cid:88) j q ij H Sij ⊗ H Rij , (cid:88) j q ij = 1 , (B7)into operators acting on the system S and reference R respectively. With this decomposition, the cost (B1) can berecast into the form C gen ( θ , V ) = (cid:88) ij p i q ij (cid:104) ψ ij | ( U † ( θ ) V ⊗ R )( H Sij ⊗ H Rij )( V † U ( θ ) ⊗ R ) | ψ ij (cid:105) (B8)where we define | ψ ij (cid:105) = | ψ i (cid:105) for all j .Now, for each index i, j , we can further decompose the corresponding input states as | ψ ij (cid:105) = (cid:88) k α ijk | ψ Sijk (cid:105) ⊗ | ψ Rijk (cid:105) , (cid:88) k | α ijk | = 1 . (B9)Here, for each pair of { i, j } , we also have the freedom to choose {| ψ Rijk (cid:105)} k as a set of orthogonal eigenstates of H Rij .Note that this may not be a Schmidt decomposition and {| ψ Sijk (cid:105)} k in general are not orthogonal. This allows us tofully factorize cost (B8), i.e., C gen ( θ , V ) = (cid:88) ijk p i q ij | α ijk | (cid:104) ψ Sijk | U † ( θ ) V H Sij V † U ( θ | ψ Sijk (cid:105)(cid:104) ψ Rijk | H Rij | ψ Rijk (cid:105) . (B10)Since {| ψ Rijk (cid:105)} k is chosen as a set of the eigenstates of H Rij , (cid:104) ψ Rijk | H Rij | ψ Rijk (cid:105) ≡ w ijk are the eigenvalues of H Rij . Wenow relabel the triple of { i, j, k } as { l } , and define w l = w ijk , H Sl = H Sijk = H Sij for all k and ˜ p l = p i q ij | α ijk | , with (cid:80) l ˜ p l = 1. On doing so, the generalized cost takes the form C gen ( θ , V ) = (cid:88) l ˜ p l w l (cid:104) ψ Sl | U † ( θ ) V H Sl V † U ( θ ) | ψ Sl (cid:105) . (B11)The gradient of this cost reads ∂ θ k C gen ( θ , V ) = (cid:88) i ˜ p l w l Tr (cid:20) V H Sl V † ∂U ( θ ) ρ Sl U † ( θ ) ∂θ k (cid:21) , (B12)where ρ Sl = | ψ Sl (cid:105)(cid:104) ψ Sl | . This gives the form of its varianceVar V [ ∂ θ k C gen ( θ , V )] = (cid:90) dV ( ∂ θ k C gen ) = (cid:88) lm ˜ p l w l ˜ p m w m (cid:90) dV Tr (cid:20) V H Sl V † ∂U ( θ ) ρ Sl U † ( θ ) ∂θ k (cid:21) Tr (cid:20) V H Sm V † ∂U ( θ ) ρ Sm U † ( θ ) ∂θ k (cid:21) = (cid:88) lm ˜ p l w l ˜ p m w m S klm . (B13)Here we have used the result of Proposition 4 to neglect the squared mean term. Using the identity presented inEq. (A6), each integral term S klm in the above summation can be evaluated similarly as S klm = (cid:20) H Sl H Sm ]2 n − − H Sl ]Tr[ H Sm ]2 n (2 n − (cid:21) (cid:2) Tr (cid:2) ( − iU ∂U † ) χ Slm (cid:3) Tr (cid:2) χ Slm (cid:3) − (Tr (cid:2) − iU ∂U † χ Slm (cid:3) ) (cid:3) , (B14)where χ Slm = U | ψ Sl (cid:105)(cid:104) ψ Sm | U † . Denote J k = − iU ∂ θ k U † . The second factor in the above solution can be written in acompact form Var χ Slm [ J k ] ≡ Tr (cid:2) χ Slm J k (cid:3) Tr (cid:2) χ Slm (cid:3) − (cid:0) Tr (cid:2) χ Slm J k (cid:3)(cid:1) , (B15)the variance of J k with respect to χ Slm . It can be seen that for the terms with l = m , it reduces to expression (A7).Thus, the variance of the gradient takes the formVar V [ ∂ θ k C gen ( θ , V )] = (cid:88) lm ˜ p l w l ˜ p m w m (cid:34) (cid:2) H Sl H Sm (cid:3) n − − (cid:2) H Sl (cid:3) Tr (cid:2) H Sm (cid:3) n (2 n − (cid:35) Var χ Slm [ J k ] . (B16)The proof is completed by evaluating the asymptotic scaling of Eq. (B16). We begin by noting thatVar V [ ∂ θ k C gen ( θ , V )] = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) lm ˜ p l w l ˜ p m w m (cid:34) (cid:2) H Sl H Sm (cid:3) n − − (cid:2) H Sl (cid:3) Tr (cid:2) H Sm (cid:3) n (2 n − (cid:35) Var χ Slm [ J k ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) lm ˜ p l ˜ p m | w l || w m | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:34) (cid:2) H Sl H Sm (cid:3) n − − (cid:2) H Sl (cid:3) Tr (cid:2) H Sm (cid:3) n (2 n − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Var χ Slm [ J k ] ≤ (cid:88) lm ˜ p l ˜ p m | w l || w m | (cid:34) (cid:12)(cid:12) Tr (cid:2) H Sl H Sm (cid:3)(cid:12)(cid:12) n − (cid:12)(cid:12) Tr (cid:2) H Sl (cid:3)(cid:12)(cid:12) (cid:12)(cid:12) Tr (cid:2) H Sm (cid:3)(cid:12)(cid:12) n (2 n − (cid:35) Var χ Slm [ J k ] . (B17)Then, defining w max = max {| w k |} , X max = max {| Tr (cid:2) H Sl H Sm (cid:3) |} , and Y max = max {| Tr (cid:2) H Sl (cid:3) || Tr (cid:2) H Sm (cid:3) |} , we havethat Var V [ ∂ θ k C gen ( θ , V )] ≤ w (cid:20) X max n − Y max n (2 n − (cid:21) (cid:88) lm ˜ p l ˜ p m Var χ Slm [ J k ] . (B18)As before, we consider a layered, parameterized circuit structure of the form U ( θ ) = (cid:81) Ni =1 U i ( θ i ) W i where { W i } is achosen set of fixed unitaries, U i ( θ i ) = exp ( − iθ i G i ) and G i is a Hermitian operator. Following an analogous argumentto that in Sec. A 3 we find that Var χ Slm [ J k ] ≤ || G k || ∞ . Defining G max = max {|| G k || ∞ } , we have thatVar V [ ∂ θ k C gen ( θ , V )] ≤ w (cid:20) X max n − Y max n (2 n − (cid:21) G max (B19)where we use (cid:80) l ˜ p l = 1. If Tr[ H Sl H Sm ] ∈ O (2 n ) and Tr[ H Sl ] ∈ O (2 n ), it follows that X max ∈ O (2 n ) and Y min ∈ O (2 n ).Therefore, additionally assuming that w l = w ijk ∈ O (1) and || G k || ∞ ∈ O (1), we find thatVar V [ ∂ θ k C gen ] ∈ O (2 − n ) (B20)as claimed. Appendix C: Expected Scaling for Typical Ansatzes We argue that under some practical circumstances, the cost function gradient can be suppressed further. Firstnote that the summation in the variance involves a large number ( ∼ n ) of states {| ψ i (cid:105)} which in general are notorthogonal to each other. The typical values of the overlaps |(cid:104) ψ i | ψ j (cid:105)| scale as 2 − n . A typical unitary ansatz alsoappears as random, which allows us to estimate the typical values of each term in Var χ Sij [ J k ], i.e., (cid:90) dU (cid:0) Tr[ χ Sij J k ] (cid:1) = (cid:90) dU Tr (cid:2) U † J k U | ψ j (cid:105)(cid:104) ψ j | U † J k U | ψ i (cid:105)(cid:104) ψ i | (cid:3) = (Tr[ J k ]) + Tr[ J k ]2 n − − (Tr[ J k ]) + Tr[ J k ] |(cid:104) ψ i | ψ j (cid:105)| n (2 n − ∼ n (C1)and (cid:90) dU Tr[ χ Sij J k ]Tr[ χ Sij ] = |(cid:104) ψ j | ψ i (cid:105)| Tr[ J k ]2 n ∼ n for i (cid:54) = j. (C2)Therefore, a ∼ − n scaling may commonly be observed in the variance of the cost function gradient. Appendix D: Cost function to learn an unknown unitary A natural choice in cost function to learn an unknown unitary V can be formulated in terms of the Hilbert-Schmidtinner product between V and a trainable unitary U as follows C HST ( U, V ) = 1 − n | Tr[ U V † ] | . (D1)In Ref. [28] this cost was demonstrated to have the following desirable properties.1. It is faithful, vanishing iff U and V agree up to a global phase φ , that is iff U = V exp( − iφ ).2. It is operationally meaningful since it can be related to the average gate fidelity between U and V .3. It can be efficiently computed on a quantum computer.To see the latter, note that C HST ( U, V ) can be written as C HST ( U, V ) = 1 − (cid:104) φ + | ( U V † ⊗ R ) | φ + (cid:105)(cid:104) φ + | ( V U † ⊗ R ) | φ + (cid:105) (D2)where | Φ + (cid:105) is a maximally entangled state across two registers each containing n qubits. Thus the cost takes the formof Eq. (5) in the main text with ρ HST = | Φ + (cid:105)(cid:104) Φ + | and H HST = − | Φ + (cid:105)(cid:104) Φ + | . That is, the computer is preparedin a maximally entangled state across the two registers, the first register is evolved under U V † , and then finally aBell state measurement is implemented across the two registers. The circuit to perform this protocol, and therebymeasure C HST ( U, V ), is known as the Hilbert-Schmidt Test and is shown in Fig. D1(a). We further note that thericochet property ( X † ⊗ ) | Φ + (cid:105) = ( ⊗ X ∗ ) | Φ + (cid:105) for any operator X implies that ( U V † ⊗ ) | Φ + (cid:105) = ( U ⊗ V ∗ ) | Φ + (cid:105) .This is used in Fig. D1(a) to apply U and V in parallel and thereby reduce the depth of the cost function circuit.While the form of C HST ( U, V ) is intuitive, for larger systems it exhibits barren plateaus even for short depthansatzes [43]. For this reason, in the numerical implementations performed in this paper we use a ‘local’ variant of theHilbert-Schmidt Test. The local cost, C LHST , can be written in the form of Eq. (5) with ρ LHST = | φ + (cid:105)(cid:104) φ + | (similarlyto C HST ) but now with H composed of a sum of projectors onto the local Bell states | φ + (cid:105) on the qubits S j and R j .Specifically, let us consider two n -qubit registers S and R and let S j ( R j ) represent the j th qubit from the S ( R )register. The C LHST cost is of the form C LHST = n (cid:80) nj =1 C j LHST where C ( j )LHST = (cid:104) φ + | ( U V † ⊗ R ) H ( j )LHST ( V U † ⊗ R ) | φ + (cid:105) (D3)and H ( j )LHST = SR − S j ⊗ | φ + (cid:105)(cid:104) φ + | S j R j ⊗ R j . (D4) FIG. D1. (a). The Hilbert-Schmidt Test. The probability to measure the all zero state at the end of the circuit is givenby Tr[ | Φ + (cid:105)(cid:104) Φ + | ( UV † ⊗ R ) | Φ + (cid:105)(cid:104) Φ + | ( V U † ⊗ R )] = n | Tr[ UV † ] | . (b) The Local Hilbert-Schmidt Test. The probability tomeasure zeros across qubits S j and R j is equal to Tr[ H ( j )LHST ( UV † ⊗ R ) | φ + (cid:105)(cid:104) φ + | ( V U † ⊗ R )]. Note, in both circuits we haveused the ricochet property ( X † ⊗ ) | φ + (cid:105) = ( ⊗ X ∗ ) | φ + (cid:105) for any operator X . (Figure adapted from Ref. [28]). Here S j denotes the set of all qubits in S except for S j , and similarly for R j . The circuit used to measure C LHST isshown in Fig. D1(b).In Ref. [28] it was proven that C LHST and C HST are related as C LHST ( U, V ) ≤ C HST ( U, V ) ≤ n C LHST ( U, V ) . (D5)Consequently, C LHST inherits C HST ’s desirable properties. Specifically, C LHST vanishes iff C HST vanishes, and hence C LHST is faithful. Furthermore, C LHST can be shown to bound the average gate fidelity. Appendix E: Average and Variance of Gradient for Approximate Designs In this section we discuss the average and variance of the gradient, when the target ensemble is not an ideal butapproximate 2-design.Let µ be a distribution of a unitary ensemble, and ∆ µ a 2-fold channel∆ µ ( ρ ) = (cid:90) µ dU U ⊗ ρ ( U † ) ⊗ . (E1)Here we use a strong definition for an approximate unitary 2-design [53], namely, a unitary ensemble with distribution µ is an (cid:15) -approximate 2-design iff (1 − (cid:15) )∆ Haar (cid:22) ∆ µ (cid:22) (1 + (cid:15) )∆ Haar , (E2)where (cid:22) is the semi-definite ordering, i.e., channel A (cid:22) B iff B − A is completely positive.The second order integral we are interested in can be expressed asTr (cid:90) µ dU AU BU † DU EU † = Tr [ A ⊗ E ∆ µ ( B ⊗ D ) W p ] , (E3)where W p is the permutation operator on the tensor product Hilbert space. Denote F U = Tr AU BU † DU EU † . When A , B , D and E are all positive operators, as a result of the semi-definite ordering (E2),(1 − (cid:15) ) (cid:90) Haar dU F U ≤ (cid:90) µ dU F U ≤ (1 + (cid:15) ) (cid:90) Haar dU F U . (E4)This relation carries over to non-positive operators, by expanding with positive operators as a basis set in the operatorspace. By choosing D = E = , the first order integral involved in the average of the gradient is also covered by theabove relation.Hence, for an (cid:15) -approximate 2-design, the average gradient is also zero, and the variance is equivalent to the caseof ideal 2-designs up to a multiplicative factor 1 + (cid:15)(cid:15)