General framework for constructing fast and near-optimal machine-learning-based decoder of the topological stabilizer codes
Amarsanaa Davaasuren, Yasunari Suzuki, Keisuke Fujii, Masato Koashi
GGeneral framework for constructing fast and near-optimal machine-learning-baseddecoder of the topological stabilizer codes
Amarsanaa Davaasuren, ∗ Yasunari Suzuki,
1, 2, † Keisuke Fujii,
3, 4, ‡ and Masato Koashi
1, 2, § Department of Applied Physics, Graduate School of Engineering,The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan Photon Science Center, Graduate School of Engineering,The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan JST, PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan Department of Physics, Graduate School of Science, Kyoto University,Kitashirakawa-Oiwakecho, Sakyo, Kyoto 606-8502, Japan (Dated: February 19, 2019)Quantum error correction is an essential technique for constructing a scalable quantum computer.In order to implement quantum error correction with near-term quantum devices, a fast and near-optimal decoding method is demanded. A decoder based on machine learning is considered as oneof the most viable solutions for this purpose, since its prediction is fast once training has been done,and it is applicable to any quantum error correcting code and any noise model. So far, variousformulations of the decoding problem as the task of machine learning have been proposed. Here,we discuss general constructions of machine-learning-based decoders. We found several conditionsto achieve near-optimal performance, and proposed a criterion which should be optimized when asize of training data set is limited. We also discuss preferable constructions of neural networks,and proposed a decoder using spatial structures of topological codes using a convolutional neuralnetwork. We numerically show that our method can improve the performance of machine-learning-based decoders in various topological codes and noise models.
I. INTRODUCTION
In order to build a scalable quantum computer, quan-tum error correction (QEC) [1–3] is a vital technique forachieving reliable computation. According to the theoryof QEC, if the noise strength is smaller than a certainthreshold value, we can protect logical qubits encodedin physical qubits from the noise. Supported by exten-sive experimental efforts, the noise level of the quantumoperations on arrays of qubits is now approaching andmeets the threshold value. Therefore, a demonstrationof QEC in a fully fault-tolerant settings is considered tobe a milestone for the near-term quantum devices [4–6].Topological codes [7–9] are a family of quantum error cor-recting codes inspired by topological nature in the con-densed matter physics [7]. Since the topological codessuch as surface codes [8, 10, 11] have both high experi-mental feasibility and high performance [12–15], they areconsidered as the most promising candidate of quantumerror correcting codes.In QEC, information on occurrence of physical errors ismeasured as a syndrome value. A suitable recovery oper-ation is estimated from the syndrome so that the originalstate of the logical qubits is decoded with high successprobability. Unfortunately, constructing an optimal de-coder is computationally hard in general. Thus, massiveefforts have been paid for developing efficient and near- ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] optimal decoders. One approach is to use the most likelyphysical errors that are consistent with the observed syn-drome value as a recovery operation. This scheme iscalled the minimum-distance (MD) decoder . Though thisdecoding method is not necessarily optimal, it shows al-most optimal performance [13–15]. In the case of thesurface codes, if we can assume that bit-flip (Pauli X )and phase-flip (Pauli Z ) errors are uncorrelated, we canconstruct an efficient MD decoder using minimum-weightperfect matching. However, if bit-flip and phase-flip er-rors are correlated or if we use other codes, even MDdecoding is not efficiently implementable [16]. Some ofthese problems can be avoided by the use of geometricallylocal features of the topological codes. For example, asfor color codes [17], we can perform decoding by project-ing color code to a surface code [18]. Another approach isto use renormalization group method [19], which is appli-cable to any topological codes including the surface andcolor codes. While these approaches have been improved,there is unavoidable trade-off between the performanceand time efficiency of the decoder. For the first exper-imental realization of QEC on near-term devices, moreefficient and near-optimal decoders are demanded.In this article, we discuss a general construction ofmachine-learning-based decoders. Recently, the technol-ogy of machine learning has been applied to various theo-retical and experimental researches of quantum physics,such as classification of readout signals in experiments[20], simulation of a quantum system [21], classificationof the phase of matter [22], data compression of the quan-tum state [23], and decoding in QEC [24–28]. In themachine-learning-based decoder, we construct a predic-tion model which outputs a recovery operator from a a r X i v : . [ qu a n t - ph ] F e b given syndrome value. The prediction model is trainedwith many correct pairs of syndrome values and correctrecovery operations before prediction. While the trainingtask may take a long time, it is required only once beforemany runs of prediction, and each prediction is expectedto be performed fast. Thus, the machine-learning-baseddecoder is one of the best solutions for demonstratingexperimental QEC in near-term quantum devices.As a prediction model, artificial neural network is be-lieved to have large representation power, and is suitablefor constructing machine-learning-based decoder. Re-cently, the performances of machine-learning-based de-coders with various neural networks have been numer-ically studied, such as restricted Boltzmann machine[24], multi-layer perceptron [25], recurrent neural net-work [26], and deep neural network [27]. The machine-learning-based decoder using a neural network is called neural decoders [24]. All these existing methods numeri-cally showed that the performance of the neural decoderis superior to the known efficient decoders when suffi-ciently large amount of the training data set is supplied.However, the following three points have yet to be under-stood. The first one is how the decoding problem shouldbe translated to the task of machine learning in order toobtain faster learning and better prediction. So far, eachof the previous studies introduces its own construction ofthe data set and neural network with little considerationon this point. Second, the spatial feature of the topolog-ical codes has not been considered in the construction ofthe neural decoder, except a very recent study [28] thatwas carried out independently of this work. While it isexpected that the performance of the neural decoder isimproved by explicitly considering the spatial arrange-ment of the syndrome, the spatial information has notbeen given to the neural network explicitly. Finally, theapplicability of the neural decoder to various topologicalcodes is not known. The neural decoder is benchmarkedonly with surface codes [24–28]. Therefore, it has notbeen known whether the neural decoder is applicable toother codes, such as color codes.We have addressed all of these points in this paper.First, we discuss how the decoding problem should beformulated as the task of machine learning. We pro-pose a general framework for constructing a neural de-coder, linear prediction framework , to elucidate the fac-tors that determine the performance of the decoders. Wepropose a criterion called normalized sensitivity whichshould be optimized for constructing a near-optimal neu-ral decoder. Then, we propose specific construction of atraining data set which minimizes the normalized sensi-tivity. We call these constructions as uniform data con-struction . We also propose the use of construction ofneural networks, which explicitly utilize spatial structureof the topological codes. We show that the performanceof the neural decoder is improved with these techniques,and it shows better performance than that of a decoderusing minimum-weight perfect matching with 10 dataset at distance d = 11 in the surface code under a depo- larizing noise. We show that the neural decoder is alsoapplicable to the color codes. The performance of theneural decoder for the color codes also reaches that ofthe MD decoder in small distances. Organization of the article
In Sec II, we overview preliminary topics. We reviewa scheme of QEC in the case of stabilizer codes. Weexplain specific constructions of the topological codes,the surface and color codes. We also review the basicsof the supervised machine learning with neural networksin this section. In Sec III, we address the question ofhow the neural decoder should be constructed. We pro-pose a general framework, linear prediction framework,in this section. We introduce a quantity called the nor-malized sensitivity, and argue that it serves as a criterionfor better performance of decoders for topological stabi-lizer codes. We also propose uniform data construction,which consists of specific instructions to optimize the nor-malized sensitivity for surface codes and color codes. Wenumerically confirm that the performance of the neuraldecoder is improved with this construction in the caseof the surface and color codes. In Sec IV, we propose anetwork construction which explicitly utilize the spatialinformation of the topological codes. We confirm thatthis construction also improves the performance of theneural decoder. Finally, we summarize this paper in SecV.
II. PRELIMINARY
In this section, we review the basic concepts and in-troduce notations used in this paper. We first review ascheme of QEC. We also introduce well-known topolog-ical codes and decoders. The scheme of supervised ma-chine learning with neural network and its terminologiesare also explained in this section.
A. Quantum error correction
We consider the case where k logical qubits are encodedin n physical qubits. We assume that any noise can berepresented as a probabilistic Pauli operation on the n physical qubits. We denote Pauli operators on a singlequbit as { I, X, Y, Z } , and the Pauli operator A on the i -th physical qubit as A i . When we consider operations onthe n physical qubits, we ignore the global phase of thestate and operator. Then, we can represent any physicalerror as E ∈ { I, X, Y, Z } ⊗ n . A weight w ( E ) is defined fora Pauli operator E on the n physical qubits as the numberof the physical qubits to which the Pauli operator E isnon-trivially applied.In the framework of stabilizer codes [29], the code is de-fined by 2 n − k stabilizer operators L I generated by n − k Pauli operators L I := (cid:104){ S i }(cid:105) (1 ≤ i ≤ n − k ), where S i ∈ ±{ I, X, Y, Z } ⊗ n , − I / ∈ (cid:104){ S i }(cid:105) , and they commutewith each other. The logical space of the code is definedas the subspace which has eigenvalue +1 for all the sta-bilizer operators, i.e., S i | ψ (cid:105) = | ψ (cid:105) for all i . We denotethe normalizer of the stabilizer operators as L . We callelements in L\L I as logical operators. Each stabilizer op-erator acts on the logical space trivially, and each logicaloperator acts on the logical space non-trivially. A dis-tance d of the code is defined as d := min L ∈L\L I w ( L ).The code which encodes k logical qubits in n physicalqubit with distance d is called [[ n, k, d ]] code.The occurrence of a physical error is detected asthe outcome of stabilizer measurement s , where s T ∈{ , } n − k and the i -th element s i is the measurementoutcome of the i -th stabilizer operator S i . We call s the syndrome vector. To recover the original state ofthe logical qubits, we estimate a recovery Pauli operatorˆ T ( s ) ∈ { I, X, Y, Z } ⊗ n from the observed syndrome vec-tor s so that the total operation including the physicalerror acts on the logical space trivially with high proba-bility. The mapping from the syndrome s to the recoveryoperator ˆ T ( s ) is called decoder ˆ T . The logical error prob-ability p L is defined as the probability with which the to-tal operation becomes logically non-trivial. Our purposeis to construct efficient decoder ˆ T which minimizes thelogical error probability p L . B. Binary representation of stabilizer code
It is convenient to translate the calculation in the stabi-lizer codes into a binary calculation in GF(2). In GF(2),addition ⊕ is performed with modulo 2. We relate thePauli operators on the i -th physical qubit to another rep-resentation I i (cid:55)→ σ ( i )00 , X i (cid:55)→ σ ( i )10 , Y i (cid:55)→ σ ( i )11 , Z i (cid:55)→ σ ( i )01 . (1)Then, a Pauli operator P on the n physical qubits canbe described as P = α n (cid:79) i =1 σ ( i ) v i v n + i , (2)where α ∈ {± , ± i } and v i ∈ { , } (1 ≤ i ≤ n ). Wedefine a binary mapping b ( P ) := v , (3)where v := ( v , v , · · · , v n − , v n ) ∈ { , } n is a rowvector, for the Pauli operator P = α (cid:78) ni =1 σ v i v n + i . Forarbitrary two Pauli operators P and P (cid:48) , b ( P ) = b ( P (cid:48) )means that the two Pauli operators are equivalent up toa global phase. The product of two Pauli operators P and P (cid:48) is represented by the sum b ( P P (cid:48) ) = b ( P ) ⊕ b ( P (cid:48) ).With 2 n × n matrix Λ = (cid:18) II (cid:19) , the commutationrelation of two Pauli operators P and P (cid:48) is given by b ( P )Λ b ( P (cid:48) ) T , which is 0 if P and P (cid:48) commute, and 1 if anti-commute. We denote this commutation relationin terms of the binary representation v , v (cid:48) ∈ { , } n as c ( v , v (cid:48) ) := v Λ v (cid:48) T . The weight of the binary represen-tation of a Pauli operator w ( v ) is defined so as to be w ( b ( P )) = w ( P ), which is equivalent to define the weightas the number of indices i (1 ≤ i ≤ n ) such that v i ⊕ v i + n ⊕ v i v i + n = 1 . (4)We use h ( v ) for the hamming weight of v as a binarystring, namely, the number of indices i (1 ≤ i ≤ n ) suchthat v i = 1. We denote the i -th row vector of the matrix M as ( M ) i . The length of the vector v is represented as | v | . With this definition, the normalizer of the stabilizeroperators L is defined as b ( L ) = { v | v ∈ { , } n , c ( v , v (cid:48) ) = 0 , ∀ v (cid:48) ∈ b ( L I ) } (5)since the normalizer of the stabilizer operators is equiv-alent to the centralizer of that in the current formalism.Note that the stabilizer group can be defined with thenormalizer L as b ( L I ) = { v | v ∈ { , } n , c ( v , v (cid:48) ) = 0 , ∀ v (cid:48) ∈ b ( L ) } . (6)With this formalism, QEC is translated as follows. Thephysical error E can be represented as a row binary vec-tor e := b ( E ) ∈ { , } n which occurs with a certainprobability p e . The syndrome vector s is given by a col-umn vector s ( e ) := H c Λ e T , where H c is an ( n − k ) × n matrix of which the i -th row vector ( H c ) i is b ( S i ). Thematrix H c is called check matrix. In binary represen-tation, we denote a decoder as r which maps a givensyndrome vector s T ∈ { , } n − k to a binary representa-tion of a recovery operator r ( s ) ∈ { , } n . It is conve-nient to define pure error t ( s ) [30] to represent variousvectors succinctly. The pure error is a function whichmaps a syndrome vector s T ∈ { , } n − k to a vector t ( s ) ∈ { , } n , and satisfies t ( s ( e )) ⊕ e ∈ b ( L ) for anarbitrary e ∈ { , } n . We also introduce a 2 k × n gen-erator matrix G such that the elements of L is uniquelyrepresented as follows: b ( L ) = { l ⊕ w G | l ∈ b ( L I ) , w ∈ { , } k } . (7)Note that the generator matrix G satisfies H c Λ G T = 0.We define the cosets L w with w ∈ { , } k as L w = { l ⊕ w G | l ∈ L } . (8)Note that L = b ( L I ). Given t ( s ) and G , an arbitraryphysical error e ∈ { , } n is uniquely decomposed as e = l ( e ) ⊕ w ( e ) G ⊕ t ( s ( e )) (9)with l ( e ) ∈ L and w ( e ) ∈ { , } k . We say w ( e ) as theclass of e .A logical decoder with a recovery operation r ( s ) cancorrect an error e if and only if e ⊕ r ( s ( e )) ∈ L . Underan error model { p e } , the logical error probability is givenby p L = Pr e ∼{ p e } [ e ⊕ r ( s ( e )) / ∈ L } ]= Pr e ∼{ p e } [ r ( s ( e )) ⊕ w ( e ) G ⊕ t ( s ( e )) / ∈ L ] } ] . (10) C. Optimal and near-optimal decoders
An optimal decoder is defined as the decoder whichminimizes the logical error probability. Let us write theconditional probability of w ( e ) ∈ { , } n for a givensyndrome vector s as q s ( w ) := Pr e ∼{ p e } [ w ( e ) = w | s ( e ) = s ] . (11)Since the decoder is only provided with s and distinctrecovery operators are needed for correcting errors withdifferent values of w ( e ), the maximum probability of suc-cessful correction given s is max w ∈{ , } q s ( w ) . We thussay a decoder is optimal if r ( s ) satisfiesPr e ∼{ p e } [ e ⊕ r ( s ) ∈ L | s ( e ) = s ] = max w ∈{ , } k q s ( w )(12)for any s with Pr e ∼{ p e } [ s ( e ) = s ] > . (13)Though the definition of w ( e ) is dependent on the choiceof t ( s ) and G , the optimality of a decoder r ( s ) is inde-pendent of the choice.Another important definition of a near-optimal de-coder is the minimum-distance (MD) decoder. An MDdecoder chooses the most probable physical error e ∗ ( s )which satisfies p e ∗ ( s ) ≥ p e ∀ e ∈ { e | s ( e ) = e } (14)as a recovery operation. Though the maximally likeli-hood physical error e ∗ ( s ) does not necessarily satisfy thecondition Eq. (12), it is empirically known that the MDdecoder achieves near-optimal performance.It is known that the MD decoder can be constructed ef-ficiently in limited cases of the code and the error model.For example, we can construct an efficient MD decoderfor the surface code under independent bit-flip and phase-flip errors. In this case, we can reduce the decoding prob-lem into minimum-weight perfect matching (MWPM),which can be efficiently solved with blossom algorithm[31]. When bit-flip and phase-flip errors are correlated,we can still construct a decoder with MWPM by ignoringthe correlation, resulting in an sub-optimal decoder. Wecall such a decoder as a MWPM decoder. D. Topological code
We consider two types of the topological codes in thisarticle: surface codes and color codes. The qubit al-location of the surface code is shown in Fig. 1. The[[2 d − d +1 , , d ]] code and the [[ d , , d ]] code are shownin Fig. 1(a) and (b), respectively. In both figures, thephysical qubits are located on the vertices of the col-ored faces. Each red face represents a stabilizer operatorwhich is a product of Pauli X operators on the physical qubits of its vertices. Each blue face represents one withPauli Z operators.The color codes consist of the lattice which has 3-colored faces: red, green, and blue. Two types of codes,the [4,8,8]-color code and the [6,6,6]-color code, are shownin Fig. 2(a) and (b), respectively. The physical qubits arealso located on each vertex of the faces. Each coloredface represents a stabilizer operator, including nontrivialPauli operators for its vertices. The [4 , , d + d − , , d ]] code, and the [6 , , d + , , d ]] code. E. Supervised machine learning
Supervised machine learning is a branch of arti-ficial intelligence that requires a training data set { ( x , y ) , . . . , ( x N , y N ) } which consists of feature data x i and its corresponding label data y i . Its aim is to pre-pare a model that takes the feature data as input andoutputs an inferred label for it. The model has a prede-termined structure and trainable parameters θ .Unlike a simple dictionary, the model is expected toinfer a label even for an unseen feature data. This isachieved by optimizing the model parameters θ for thetraining data set. This process is commonly called train-ing . Specifically, during its training, the difference be-tween the output of the model y (cid:48) to a feature and thecorrect label y is evaluated with real-valued loss func-tion L ( y , y (cid:48) ). The loss is minimized if and only if theprediction is exactly the same as the correct label. Thetraining data is used to optimize the model parameters θ to reduce the loss. This can be done with standard op-timization methods such as stochastic gradient descent: θ ← θ − γ ∇ θ L, (15)where γ ∈ R is a learning rate and L is calculated for arandomly chosen subset, called a batch , of the trainingdata set. As we can see here, it is required that theloss function should be differentiable, such as L2 distance || y − y (cid:48) || . Once trained, we can apply the model to anunseen feature data, and obtain its predicted label withsimple calculations of the network parameters and theinput feature data.Artificial Neural Network (ANN) is a machine learn-ing model inspired by neural structure found in nature.Here, we assume that neurons are real-valued functionsand a layer h is a vector of the neurons. Multilayer per-ceptron (MLP) is one of the simplest ANN which, as itsname suggests, consists of multiple layers of neurons in-cluding the input and output layers. Here each neuron ina layer is connected to all neurons in the neighboring lay-ers with trainable weights and biases, and yet completelyindependent of the other neurons in its own layer. Math-ematically, this can be described as h ( n ) i = A ( (cid:88) i W ( n,n − ij h ( n − j + b ( n ) i ) (16) (a) (b) FIG. 1: The qubit allocation of the surface codes with (a) [[2 d − d + 1 , , d ]] code and (b) [[ d , , d ]] code. Eachvertex corresponds to a physical qubit. Red and blue faces correspond to stabilizer measurements with X and Z Pauli operators, respectively. (a) (b)
FIG. 2: The qubit allocation of the [4,8,8]-color code and the [6,6,6]-color code. Each vertex corresponds to aphysical qubit, and each face corresponds to a stabilizer operator.where A is a nonlinear activation function, h ( n ) i is the i -th neuron in the n -th layer, b ( n ) i is the bias added to the i -th neuron in the n -th layer, and W ( n,n − ij is the weightconnecting the i -th neuron in the n -th layer to the j -th neuron in the ( n − ∇ θ L is evaluated with the back-propagationmethod. According to the universal approximation the-orem [32], any continuous function can be approximatedby an MLP model of a finite size, though its structure issimple and compact. Thus, we expect a neural decoderwith a MLP model can achieve near-optimal performance under an appropriate training process. III. CONSTRUCTION OF TASKS OFMACHINE-LEARNING-BASED DECODERS
In general, achievable accuracy in machine learningwith a given size of training data depends on the for-mulation of the prediction task. In order to construct anear-optimal neural decoder, it is vital to consider what isa preferable formulation of the prediction task. However,this point has not been discussed in a unified view in theexisting methods [24–27]. In this section, we discuss howthe decoding problem should be formulated as a task ofmachine learning in order to achieve near-optimal per-formance. To this end, we propose a general framework,which we call linear prediction framework . In this frame- 𝒋 𝒊 𝑊 $% 𝒉 (()*) 𝒉 (() FIG. 3: Feed forward network. The j -th neuron in the( n − i -th neuron in the n via weight W ij .work, we can analytically study the behavior of the neu-ral decoder, and can discuss requirements for achievingnear-optimal performance. Based on the discussion, wepropose a criterion, normalized sensitivity , which shouldbe optimized in defining the label for constructing a gooddecoder. We show specific constructions which minimizenormalized sensitivity for the surface codes and the colorcodes, which we call uniform data construction . Then,we numerically confirm that the performance of the neu-ral decoder is improved with the construction. We alsoconfirm that this construction is also applicable to thecolor codes. A. Linear prediction framework
In order to discuss the behavior of the neural decoderin a unified view, we consider a neural decoder with thefollowing two specifications. First, the neural decoderuses the syndrome vector s as the feature data to befed to the trainable model. Second, the label data is abinary vector, and the correct label is linearly generatedfrom the physical error vector e in GF(2). We call alinearly generated label vector g as a diagnosis , and amatrix H g which generates the diagnosis g := H g Λ e T asa diagnosis matrix. We denote the length of the diagnosisvector g as L g . The recovery operator r is calculatedfrom the predicted diagnosis g and the syndrome s . Weuse an assumed physical error distribution { p e } only forgenerating a training data set { ( s i , g i ) } , and do not use it for constructing H g or in the calculation of the recoveryoperator r from g and s . Though this framework restrictsthe label to be linearly generated from the physical error,this is general enough to formulate all the constructionsdescribed in the existing methods as special cases [24–28]with small technical exceptions.Since the actual performance of the neural decoder de-pends on many factors such as configurations of the train-ing process, the size of the training data set, and detailsof the network construction, we start with consideringthe problem under an ideal limit. We first consider theproblem under the simple 0-1 loss function with an unlim-ited size of the training data set. Then, we relax theseimpractical assumptions to practical ones. Though wenumerically investigate the case of a single logical qubit( k = 1) later, we present the formalism for a generalvalue of k .
1. The neural decoder with the 0-1 loss function and anunlimited training data set
We first consider a hypothetical decoder that can min-imize any loss function with an unlimited number of thetraining data set. Though such an assumption is notpractical, it is convenient to reveal the conditions for per-forming optimal decoding with machine learning in theideal limit. We choose the 0-1 delta function δ ( g , g (cid:48) ) asthe loss function, which is zero if the predicted and thecorrect diagnosis are the same, and unity otherwise. Letus consider the portion of training data set with a specificvalue of s with Pr e ∼{ p e } [ s ( e ) = s ] >
0. If the neural de-coder returns diagnosis g for the input s , the total lossfor this portion is proportional to the following value, L ( δ ) s ( g ) := E e ∼{ p e } (cid:2) δ ( g , H g Λ e T ) (cid:12)(cid:12) s ( e ) = s (cid:3) = 1 − Pr e ∼{ p e } (cid:2) H g Λ e T = g (cid:12)(cid:12) s ( e ) = s (cid:3) . (17)Let g ( δ ) ( s ) be the output of the ideally trained neuraldecoder. Since it should minimize the total loss for every s , it satisfies L ( δ ) s ( g ( δ ) ( s )) = min g L ( δ ) s ( g ) . (18)We call this ideal decoder a delta diagnosis decoder and g ( δ ) ( s ) a delta diagnosis vector .We show the condition for a diagnosis matrix H g toguarantee that we can perform the optimal decoding withthe delta diagnosis decoder. To this end, we define aproperty of the diagnosis matrix and introduce a set ofdiagnosis vectors as follows. Definition III.1. faithful diagnosis matrix — Given acheck matrix H c , we say diagnosis matrix H g is faithful if span( { ( H cg ) i } ) = b ( L ) , (19)or equivalently, H cg Λ e T = 0 ↔ e ∈ L , (20)where H cg := (cid:18) H c H g (cid:19) . (21) Definition III.2. faithful diagnosis vectors — Given acheck matrix H c , a pure error t ( s ), and a faithful diag-nosis matrix H g , we define 2 k faithful diagnosis vectors { g s ( w ) } ( w ∈ { , } k ) associated with a syndrome vec-tor s by g s ( w ) := H g Λ( w G ⊕ t ( s )) T . (22)Note that the faithful condition of H g implies that w (cid:55)→ g s ( w ) (23)is injective and H g Λ e T = g s ( w ( e )) , (24)with s = H c Λ e T . As a result, when H g is faithful, wehave1 − L ( δ ) s ( g ) = Pr e ∼{ p e } [ g s ( w ( e )) = g | s ( e ) = s ] (25)from Eqs. (17) and (24). Then the injective property of g s ( w ) leads to 1 − L ( δ ) s ( g s ( w )) = q s ( w ) , (26)where q s ( w ) is defined in Eq. (11).When the diagnosis matrix is faithful, we can constructan optimal decoder as follows. From Eqs. (18) and (26),we see that the delta diagnosis vector g ( δ ) ( s ) is one ofthe faithful diagnosis vectors. We can thus write it inthe form g ( δ ) ( s ) = g s ( w ∗ ( s )) . (27)Eqs. (18), (26), and (27) imply that1 − q s ( w ∗ ( s )) = L ( δ ) s ( g ( δ ) ( s ))= min w ∈{ , } k (1 − q s ( w ))= 1 − max w ∈{ , } k q s ( w ) . (28)Since g s ( w ) is injective, one can calculate w ∗ ( s ) from thediagnosis g ( δ ) ( s ) and syndrome s . The recovery operatoris then chosen as r ( s ) = w ∗ ( s ) G ⊕ t ( s ) . (29)For the optimality, we havePr e ∼{ p e } [ e ⊕ r ( s ) ∈ L | s ( e ) = s ] = q s ( w ∗ ( s ))= max q s ( w ) (30)for any s with Pr e ∼{ p e } [ s ( e ) = s ] >
0, which satisfiesEq. (12).We can also prove a converse statement for the caseswhere H g is not faithful (see Appendix A), arriving atthe following lemma. Lemma III.1.
If the diagnosis matrix H g is faithful,there exists a map r ∗ ( g , s ) such that the decoder with r ( s ) = r ∗ ( g ( δ ) ( s ) , s ) is optimal for arbitrary distribution { p e } . If the diagnosis matrix H g is not faithful, no suchmap exists.This lemma implies that we can perform optimal de-coding with the delta diagnosis decoder only when thediagnosis matrix H g is faithful. Note that the set of thefaithful vectors { g s ( w ) | w ∈ { , } k } is independent ofthe choice of the generator G and the pure error t ( s ).Whether we can perform the optimal decoding or not isdependent only on the construction of H g .
2. The neural decoder with the L2 loss function and anunlimited training data set
In this subsection, we replace the 0-1 loss function witha more practical one, which is the squared L2 distance.We still consider the limit of an infinite size of the trainingdata set and the perfect loss minimization. In this case,the total loss for a fixed s under an unlimited trainingdata set is proportional to the following value. L (L2) s ( g ) = E e ∼{ p e } (cid:2) || g − H g Λ e T || (cid:12)(cid:12) s ( e ) = s (cid:3) (31)We define a decoder which is ideally trained with the L2loss function as an L2 diagnosis decoder . We also callthe output of the L2 diagnosis decoder as an
L2 diagno-sis vector g (L2) ( s ). The L2 diagnosis vector satisfies thefollowing equation. L ( L s ( g (L2) ( s )) = min g ∈{ , } Lg L (L2) s ( g ) . (32)When the chosen diagnosis matrix is faithful, we can an-alytically solve g (L2) ( s ) by differentiating Eq. (31), andthe L2 diagnosis vector can be written as follows. g (L2) ( s ) := (cid:88) w ∈{ , } k q s ( w ) g s ( w ) (33)Let us define a column vector of order 2 k as q s := ( q s (0 k ) , . . . q s (1 k )) T . (34)It satisfies the following matrix equation: (cid:18) ˆ g (L2) ( s )1 (cid:19) = D s q s , (35)where D s = (cid:18) g s (0 k ) · · · g s (1 k )1 · · · (cid:19) . (36)We can solve it for q s if D s has a left inverse D − s suchthat D − s D s = I in the real-valued calculation, namely, ifthe rank of D s as a real-valued matrix is 2 k . If the rankis smaller, solution q s is not unique, and hence it is notalways possible to determine w that maximizes q s ( w ),which implies we cannot perform the optimal decoding.Though the rank condition depends apparently on thesyndrome s , we can formulate it as a condition which isindependent of s . Any faithful diagnosis g s ( w ) can bewritten as g s ( w ) = H g Λ( w G ) T ⊕ δ ( s ) (37)with δ ( s ) := H g Λ t ( s ) T ∈ (cid:0) { , } L g (cid:1) T . (38) We define a transformation σ δ by( σ δ ( v )) i := δ i + ( − δ i v i (39)for δ ∈ { , } k and v ∈ R k . It is affine, isometric, andinvolutory. Since g s ( w ) = σ δ ( s ) ( H g Λ( w G ) T ), we have D s = (cid:18) σ δ ( s ) ( H g Λ((0 k ) G ) T ) · · · σ δ ( s ) ( H g Λ((1 k ) G ) T )1 · · · (cid:19) . (40)We see that a transformation σ δ is an affine transforma-tion, and this transformation satisfies σ δ ( σ δ ( v )) = v (41) σ ( v ) = v . (42)Thus, when we apply the transformation σ δ ( s ) to Eq. (35), we obtain (cid:18) σ δ ( g (L2) ( s ))1 (cid:19) = D q s , (43)where D := (cid:18) H g Λ((0 , . . . , G ) T · · · H g Λ((1 , . . . , G ) T · · · (cid:19) . (44)Thus, we can uniquely calculate q s for an arbitrary s if a matrix D has a left inverse, which is equivalent tothe condition that { H g Λ( w G ) T | w ∈ { , } k } is affinelyindependent. We will call a diagnosis matrix satisfyingthis condition to be decomposable: Definition III.3. decomposable diagnosis matrix —Given a generator matrix G , we say a diagnosismatrix H g is decomposable if a set of real vectors { H g Λ( w G ) T | w ∈ { , } k } is affinely independent,namely, the rank of a matrix D defined in Eq. (44) is2 k when we consider D as a real-valued matrix.When H g is faithful, the above definition is indepen-dent of G , because the set { H g Λ( w G ) T | w ∈ { , } k } isindependent of G then.We show a scheme to perform the optimal decodingusing L2 diagnosis decoder when a diagnosis matrix isfaithful and decomposable. When H g is decomposable,there exists a left inverse D − such that D − D = I inreal vector space. When we observe a syndrome vector s ,we obtain the L2 diagnosis g (L2) ( s ) using the trained L2diagnosis decoder, and calculate δ ( s ) = H g Λ t ( s ). Sincethe diagnosis matrix is faithful, the probabilities of thefaithful diagnosis vectors are given by q s = D − (cid:18) σ δ ( s ) ( g (L2) ( s ))1 (cid:19) . (45) Then, we construct a recovery operator as r ( s ) = w ∗ ( s ) G ⊕ t ( s ) , (46)where w ∗ ( s ) satisfies q s ( w ∗ ( s )) = max w q s ( w ) . (47)With this recovery operator, we obtainPr e ∼{ p e } [ e ⊕ r ( s ) ∈ L | s ( e ) = s ] = q s ( w ∗ ( s )) , (48)and thus this decoder satisfies Eq. (12).When the diagnosis matrix H g is faithful, we can alsoprove a converse statement for the cases where a faithfuldiagnosis matrix H g is not decomposable (see AppendixA), arriving at the following lemma. Lemma III.2.
If the diagnosis matrix H g is faithful anddecomposable, there exists a map r ∗ ( g , s ) such that thedecoder with r ( s ) = r ∗ ( g (L2) ( s ) , s ) is optimal for arbi-trary distribution { p e } . If the diagnosis matrix H g isfaithful but not decomposable, no such map exists.We show a simple example of a faithful and decompos-able matrix H g in the case of k = 1. We choose vectors l , l , and l from L , L , and L , respectively. Weconstruct H g and generator G as H g = l l l , (49) G = (cid:18) l l (cid:19) . (50)We see that span( { ( H g ) i } ) = b ( L ), and thus H g is faith-ful. A set { H g Λ( w G ) T | w ∈ { , , , }} is { (0 , , T , (0 , , T , (1 , , T , (1 , , T } , (51)which is affinely independent, and thus H g is decompos-able. We can verify the same by checking the rank of D = (cid:18) g (00) g (01) g (10) g (11)1 1 1 1 (cid:19) = (52)to be 4 in real vector space. We can show that therealways exists such a faithful and decomposable diagnosismatrix for all k and H c . See Appendix A for the proof.
3. The neural decoder with the L2 loss function under afinite training data size
In practical cases, the size of the training data set islimited, and hence the loss is not perfectly minimized.This implies that the output diagnosis from the modeldeviates from the L2 diagnosis vector. In such a case, itis desirable to construct a decoder such that its predic-tion is as robust against the deviations as possible. Weintroduce a slight modification to the optimal decodingscheme in the last subsection, so that it should applica-ble to an output diagnosis deviated from the L2 diagnosisvector.We denote the predicted diagnosis as g P ( s ) ∈ R L g ,which deviates from the L2 diagnosis vector. Note that g P ( s ) cannot be represented as a linear combination ofthe faithful diagnosis vectors in general. In order toconstruct a decoding scheme which is robust to a smalldeviation, it is natural to extend the scheme employedin Sec. III A 2 such that we project g P ( s ) to the hyper-plane formed by affine combinations of the faithful diag-nosis vectors, and then extract the coefficients q P s fromthe projected point. This projection and extraction isachieved as follows. We perform QR decomposition for D , and obtain D = QR , where Q is an orthogonal ma-trix, and R is an upper-triangular matrix. We construct D − = R − Q T , which satisfies D − D = I . Then, weobtain a predicted vector q P s as q P s = D − (cid:18) σ δ ( s ) ( g P ( s ))1 (cid:19) , (53)where δ ( s ) = H g Λ t ( s ). We construct a recovery opera-tor as r ( s ) = w ∗ ( s ) G ⊕ t ( s ) , (54) where w ∗ ( s ) satisfies q P s ( w ∗ ( s )) = max w q P s ( w ) . (55)Note that though elements of q P s may be out of [0 ,
4. Criterion for diagnosis matrix
In practice, the number of the training data set is farsmaller than the total variation of syndrome vectors s when distance d is larger than about 7. For example,according to the existing methods [24–27], the size of thetraining data set is at most 10 . On the other hand, thenumber of variations in the syndrome, 2 n − k , becomeslarger than 10 at the distance d = 7 for the [[ d , , d ]]surface code. This implies that almost all the patterns ofthe syndrome vector s given in experiments are not foundin the training data set. The model should infer the L2diagnosis vector g (L2) ( s ) of s where s is not included inthe training data set. The aim of this subsection is topropose a criterion for H g which we believe to reflect therobustness of the prediction when we use such a sparselysampled training data set.Since the problem is to estimate the vector-valuedfunction g (L2) ( s ) from a sparsely sampled set of values,its difficulty should depend on how rapidly the functionchanges its output value as the input value s varies. FromEqs. (24) and (33), we see that the function is written as g (L2) ( s ) = E e ∼{ p e } [ H g Λ e T | s ( e ) = s ] , (56)which shows that g (L2) ( s ) is implicitly determined fromthe two functions of errors, g ( e ) = H g Λ e T and s ( e ) = H c Λ e T . In order to quantify how rapidly these functionchange, let us introduce a sensitivity m ( H ) of a binarymatrix H as m ( H ) := max e , e (cid:48) ∈{ , } n h ( e ⊕ e (cid:48) )=1 || H Λ e T − H Λ e (cid:48) T || = max e ∈{ , } n h ( e )=1 h ( H Λ e T ) . (57)Using the sensitivity, the variation of s ( e ) is boundedas || s ( e ) − s ( e (cid:48) ) || ≤ m ( H c ) h ( e ⊕ e (cid:48) ) . (58)In the case of topological codes, m ( H c ) is a small con-stant. This is because each physical qubit is monitoredby at most constant number of the stabilizer operators.Suppose that g (L2) ( s ) is close to one of the faithfuldiagnosis g s ( w ∗ ), and let S ( s , w ∗ ; 0) be the set of errors e satisfying w ( e ) = w ∗ and s ( e ) = s . We further definea set0 S ( s , w ∗ ; h ) := { e |∃ e (cid:48) s.t. e (cid:48) ∈ S ( s , w ∗ ; 0) , h ( e ⊕ e (cid:48) ) ≤ h } (59)We see that any e ∈ S ( s , w ∗ ; h ) produces a training data( s (cid:48) , g (cid:48) ) such that || s (cid:48) − s || ≤ m ( H c ) h (60) || g (cid:48) − g s ( w ∗ ) || ≤ m ( H g ) h. (61)The choice of H g also affects how precisely g (L2) ( s )should be estimated in order to determine w ∗ correctly. To quantify this, we consider how far g P ( s ) can be devi-ated from a faithful diagnosis g s ( w ) without affecting thedecoding method of Eqs. (53) and (54). When the decod-ing result changes from w ∗ = w to w ∗ = w (cid:48) , the solutionof Eq. (53) should satisfy q P s ( w ) = q P s ( w (cid:48) ), namely, g P ( s )should be written in the form g P ( s ) = α ( g s ( w ) + g s ( w (cid:48) )) + (cid:88) w (cid:48)(cid:48) (cid:54) = w , w (cid:48) β w (cid:48)(cid:48) g s ( w (cid:48)(cid:48) ) . (62)We define the minimum boundary distance M ( H g ) so as to assure that w ∗ = w as long as || g P ( s ) − g s ( w ) || ≤ M ( H g ). Hence M ( H g ) can be explicitly defined as M ( H g ) := min w , w (cid:48) ,α, { β w (cid:48)(cid:48) } || (1 − α ) g ( w ) − α g ( w (cid:48) ) − (cid:88) w (cid:48)(cid:48) (cid:54) = w , w (cid:48) β w (cid:48)(cid:48) g ( w (cid:48)(cid:48) ) || . (63)Note that the above definition is independent of s , sincethe affine transformation σ δ ( s ) is isometric. M ( H g ) isnonzero if and only if H g is decomposable.Regarding M ( H g ) as the relevant length scale, we de-fine the following quantity to be used as a criterion for abetter construction of H g . Definition III.4.
Normalized sensitivity —
We definenormalized sensitivity N ( H g ) of a faithful and decom-posable matrix H g as N ( H g ) := m ( H g ) M ( H g ) , (64)where m ( H g ) is a sensitivity of H g defined in Eq. (57),and M ( H g ) is a minimum boundary distance of H g de-fined in Eq. (63).Eqs. (61) and (63) implies that an error belonging to S ( s , w ∗ ; h ) with h ∼ ( m ( H g ) /M ( H g )) − leads to a train-ing data useful for estimation of g (L2) ( s ). We thus ex-pect that the use of a diagnosis matrix H g with a smallnormalized sensitivity N ( H g ) enables high-performanceprediction with a small training data set.
5. Uniform data construction
We propose specific constructions which minimize thenormalized sensitivity up to the order of d in the case of k = 1. We first consider a lower-bound of the normal-ized sensitivity. When a diagnosis matrix H g is faithful,each row vector of H g corresponds to a logical operatoror a stabilizer operator. We denote the number of thelogical operators in the rows of H g as n L . The minimumboundary distance M ( H g ) is upper-bounded by M ( H g ) ≤ n L d of one-elementsin its binary representation, there are at least dn L ofone-elements in the diagnosis matrix. By denoting thenumber of the one-elements in the diagnosis matrix H g as χ ( H g ), we have dn L ≤ χ ( H g ) . (66)Since there are 2 n columns in H g , we also have χ ( H g ) ≤ n max i h (( H T g ) i ) . (67)The sensitivity m ( H g ) is equal to the maximum hammingweight of the column vectors of the diagnosis matrix,namely, max i h (( H g ) T i ) = m ( H g ) . (68)From Eqs. (65) - (68), we obtain N ( H g ) ≥ dn (69)1In particular, when we focus on the two-dimensionaltopological codes such that n = Θ( d ) , the order ofthe normalized sensitivity is lower-bounded as N ( H g ) = Ω( d − ) . (70)For surface codes and color codes with the single logicalqubit, we found specific constructions of H g such that N ( H g ) scales as Θ( d − ). See appendix C for the specificconstructions. We named these constructions as uniformdata construction of the data set, since logical operatorscorresponding to the rows of H g are chosen uniformly tocover all the physical qubits. B. Construction of data set and example
Let us summarize the discussion in Sec III A. Given thecheck matrix H c of the code and the error model { p e } ,the whole protocol can be described as follows. • Preparation:
We construct a faithful and de-composable diagnosis matrix H g with a small nor-malized sensitivity, possibly N ( H g ) = Θ( d/n ). Wechoose a pure error t ( s ) and a generator matrix G .We perform QR decomposition to a matrix D := (cid:18) g (0 k ) · · · g (1 k )1 · · · (cid:19) , (71)where g ( w ) = H g Λ( w G ) T , and obtain Q and R .We calculate the left inverse matrix D − as D − := R − Q T . (72) • Data generation:
We generate a set of physicalerrors { e , e , . . . } with the probability distribution { p e } , and generate data set { ( s , g ) , ( s , g ) , . . . } from it, where s i := H c Λ e T i and g i := H g Λ e T i . • Training:
The model is trained so that it canpredict g from s . The loss of the prediction is de-fined as the L2 distance between g and g P , where g P is a real-valued output vector of the model. • Prediction:
When an observed syndrome s isgiven to the trained model, it predicts g P ( s ). Inparallel, we calculate δ ( s ) given by δ ( s ) = H g Λ t ( s ) T . (73)We calculate vector q s defined in Eq. (34) as q P s = D − (cid:18) σ δ ( s ) ( g P ( s ))1 (cid:19) , (74)where σ δ ( s ) is an affine transformation such that( σ δ ( s ) ( v )) i = δ i + ( − δ i v i . (75) We choose w P that satisfies q P s ( w P ) = max w ∈{ , } k q P s ( w ) , (76)where { q P s ( w ) } is the elements of q P s . Then, weobtain an estimated recovery operator r ( s ) = w P G ⊕ t ( s ) . (77)We emphasize that the choice of t ( s ) and G dose notaffect the performance of the decoder, since the successof the estimation is independent of them. Only the con-struction of H g affects the performance of the decoder.We show a specific example of the decoding scheme.For simplicity, we consider the case where there is onlybit-flip errors in the [[2 d − d + 1 , , d ]] surface code. Inthis case, it is enough for QEC to consider the stabilizeroperator with Pauli Z operators. The simplified pictureof the code is shown in Fig. 4.In this picture, a bit-flip error on a physical qubit isrepresented by the color of the corresponding edge (green:no error, red: error), and the syndrome value is rep-resented by the color of the circle (green: undetected,red: detected). As shown in Fig. 4(a), The matrix H g isconstructed with logical operators each of which is theproduct of the Pauli Z operators on the edges crossingthe dotted line. In this case, we see M ( H g ) = O ( d ), m ( H g ) = O (1), and m ( H g ) M ( H g ) = O ( d − ).Suppose that bit-flip errors occur on a set of the phys-ical qubits as shown in Fig. 4(b). The physical error isdetected with the syndrome values as shown in the samefigure. The diagnosis vector is calculated as the com-mutation relation of the chosen logical operators and thephysical error. We show the calculated diagnosis on theright side of the lattice. In the training phase, the modellearns the relation between the positions of the red circlesand the values of the diagnosis vector. In the predictionphase, only the positions of the red circles are given. Thetrained neural network outputs a real-valued predictionof the diagnosis vector as shown in Fig. 4(c), for exam-ple. From this information, we extract the probabilitiesof the faithful diagnosis, and we choose the faithful di-agnosis which is expected to be the most probable, asshown in Fig. 4(d). Since the chosen diagnosis vector isequivalent to the diagnosis vector generated by the actualphysical error, this decoding trial is a success. C. Relation to the existing methods
In this subsection, we explain how the existing meth-ods [24–27] can be treated in the linear prediction frame-work. The method proposed by Varsamopoulos et al. [25] used an approach similar to the example shown inSec. III A 2 in the case of k = 1. In this method, a linearmap is used for the pure error, which is called a sim-ple decoder. The pure error is then written in the form2 (a) (b) (c) (d) FIG. 4: The figures show the decoding process based on proposed scheme. Each picture shows only Z lattice, ofwhich the edge corresponds to whether there is a bit-flip error on the physical qubit or not, and the circle showswhether an error is detected through the syndrome measurement. (a) Five logical Z operators which minimize thenormalized sensitivity m ( H g ) M ( H g ) . (b) The actual physical error is drawn as red edges, and the detected syndromes asred circles. The binary numbers shown to the right is the diagnosis vector of the physical errors. The neural networklearns the relation between the location of the detected syndromes and the diagnosis vector. (c) The real-valueddiagnosis vector is predicted by the neural decoder. (d) With the syndrome pattern, faithful diagnosis vector iseither 10000 or 01111. The chosen faithful diagnosis vector is 10000. Accordingly, we choose the recovery operatorshown in the figure. In this case, the decoding succeeds. t ( s ) T = T s , where T is a 2 n × ( n − k ) matrix satisfying H c Λ T = I . The label vector used in this method canessentially be regarded as being generated by a diagnosismatrix defined by H g = l l l ( I ⊕ Λ T H c ) . (78)We see this is faithful and decomposable constructions.Let a generator matrix G be G = (cid:18) l l (cid:19) . (79)Then, a diagnosis generated from the diagnosis matrix is g = H g Λ e = w ( e ) w ( e ) w ( e ) ⊕ w ( e ) , (80)where w ( e ) = ( w ( e ) , w ( e ) ). The method in Ref. [25]uses a different set of label vectors g (cid:48) called one-hot rep-resentation, which has a one-to-one correspondence with g as g = (0 , , T (cid:55)→ g (cid:48) = (1 , , , T (81) g = (0 , , T (cid:55)→ g (cid:48) = (0 , , , T (82) g = (1 , , T (cid:55)→ g (cid:48) = (0 , , , T (83) g = (1 , , T (cid:55)→ g (cid:48) = (0 , , , T . (84)The above relation as real vectors can be written as g (cid:48) = 12 − − − − − − (cid:18) g (cid:19) + . (85)Since it is an isometric affine transformation, we expectthat this transformation has little effect on the perfor-mance of the supervised machine learning. The matrix H g is faithful and decomposable, but its normalized sensi-tivity is O (1). We thus expect that this decoder becomesnear-optimal when the training is ideally performed, butthe prediction is not robust when the size of the trainingdata set is small.The method proposed by Baireuther et al. [26] mainlyfocuses on a model applicable to quantum error correc-tion when we perform various counts of repetitive stabi-lizer measurements by utilizing recurrent neural network.They use the commutation relation between the physicalerror and a logical Z operator as the label, since theyonly concerned about the logical bit-flip probability withthe fixed initial state in the logical space. We can thusconsider this method as a case of the linear predictionframework.Torlai et al. [24], Krastanov et al. [27], and Breuck-mann et al. [28] took a different approach from the abovetwo [25, 26]. They used the binary representation of thephysical error as the label vector. In the linear predictionframework, it corresponds to a choice of H g = Λ leadingto g = H g Λ e T = e T (86)Since H g is not faithful, it cannot constitute an optimaldecoder even with the delta diagnosis decoder. Interest-ingly, the delta diagnosis decoder with this choice of H g works as an MD decoder, which can be shown by thefollowing lemma. Lemma III.3.
If the matrix H cg has rank 2 n in GF(2),there exists a map r ∗ ( g , s ) such that the decoder with r ( s ) = r ∗ ( g ( δ ) ( s ) , s ) works as an MD decoder for arbi-trary distribution { p e } . If H cg does not have rank 2 n ,no such map exists. Proof. If H cg has rank 2 n , there exists a left inverse bi-nary matrix H − cg such that H − cg H cg = I . Then, we can3obtain the physical error e asΛ H − cg (cid:18) sg (cid:19) = e T . (87)Thus, we can obtain the most probable physical error e ∗ ( s ) from the most probable diagnosis.If H cg does not have rank 2 n in GF(2), there exist twophysical errors which generate the same pair of syndromeand diagnosis. We cannot determine which is more prob-able. Thus, we cannot perform MD decoding when H cg does not have rank 2 n .A drawback in this approach is difficulty arising whenwe replace a loss function with a practical one such asL2 distance. In order to satisfy a decomposable propertyin MD decoding, the length of the diagnosis must be noshorter than 2 n + k since there are 2 n + k possible candi-dates of the most probable physical error. This is notpractical when the distance is large, and thus it requiresheuristics such as repetitive sampling. D. Numerical result
We numerically show that the uniform data construc-tion improves the performance of the neural decoder inthe case of k = 1. We trained an MLP model with theuniform data construction, and compare it with otherdata constructions of the neural decoders. We also makea comparison with known decoders such as the MD de-coder and the MWPM decoder. We choose the [[ d , , d ]]surface code for the comparison, since most of the exist-ing methods were benchmarked with this code. We cal-culated the performance for two types of error models,the bit-flip noise and depolarizing noise. The probabilitydistribution of the bit-flip noise is described as follows. p e = (cid:40) p w ( e ) (1 − p ) n − w ( e ) ∀ i > n, e i = 00 otherwise , (88)where p is an error probability per physical qubit, and w ( e ) is a weight of physical error e defined in Eq. (4).The probability distribution of the depolarizing noise isdescribed as follows. p e = ( p/ w ( e ) (1 − p ) n − w ( e ) . (89)Note that the occurrences of the bit-flip and phase-fliperrors are correlated in the depolarizing noise. We firstcalculated the performance when the physical error prob-ability is around the error threshold, namely, p = 0 . p = 0 .
15 for the depolarizingnoise. The tunable hyper-parameters of the neural net-work, such as number of layers in network, number ofneurons in each layer, and learning rate, are optimizedwith a grid search for each noise model and for each sizeof the training data set. See Appendix B for the detailsof the parameter optimization and implementation. The performance of the neural decoder under the bit-flip noise is shown in Fig. 5(a). The solid lines are theperformance of the neural decoder with the uniform dataconstruction. The bottom dashed lines represent the log-ical error probability achievable with the MD decoder.The colors red, green, blue, and cyan correspond to dis-tances 5, 7, 9, 11, respectively.Comparing these two types of decoders, we see thatthe logical error probability of the neural decoder is near-optimal with 10 data set at distance 11. On the otherhand, there are gaps between the converged logical errorprobabilities of the neural decoder and that of the MDdecoder when the distance is large. We speculate thatthese gaps are caused by imperfect learning of the spatialinformation of the topological codes, since it is partiallyimproved with the network construction discussed in thenext section.We also implemented the neural decoder with short di-agnosis, i.e. the construction with N = N = N = 1,where N w is a number of logical operators in the rowsof H g corresponding to the class w . This is equivalentto the construction which we showed as an example inSec. III A 2. We call this construction, with the normal-ized sensitivity of O (1), as short diagnosis construction,which is shown as the pale plots in Fig. 5(a). Note thatthe performance of this decoder depends on the choiceof the logical operators. We have tried this constructionwith various choice of the logical operators. The plot-ted data is the best among our trials. Although bothconstructions become near-optimal in the limit of largetraining data size, we see that the performance with theuniform data construction achieves smaller logical errorprobability than that with the short diagnosis construc-tion for any size of the training data set. We have alsoconfirmed that the performance of the neural decoder de-grades when the row vectors of H g consist of the same O ( d ) logical operators of X , Y and Z . In this case,while the number of the rows in H g is the same as thatof the uniform data construction, the sensitivity m ( H g )becomes O ( d ), which makes the normalized sensitivity m ( H g ) M ( H g ) to be O (1). Though these results are not plot-ted, the performance of this construction is almost thesame as the short diagnosis construction. These resultssupport our argument that it is essential for the perfor-mance of the neural decoder to minimize the normalizedsensitivity.The results with the depolarizing noise are shown inFig. 5(b). Note that for the surface code under corre-lated noise such as the depolarizing noise, it is not knownhow an efficient MD decoder can be constructed. We seethat the performance of the neural decoder becomes near-optimal, and is superior to that of the MWPM decoderwith 10 training samples at d = 5 , , , and calculatedthe performance for the distance d = 5 , , ,
11 and for4 The size of the training data set0.100.150.200.250.300.350.400.450.50 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) The size of the training data set0.10.20.30.40.50.60.7 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) FIG. 5: The performance comparison between the neural decoder with the uniform construction (solid lines) andthat with short diagnosis construction (pale lines), the MD decoder (dashed lines), and the MWPM decoder (dottedlines) in the case of the [[ d , , d ]] surface code. The logical error probabilities are plotted against of the sizes of thetraining data set with the fixed physical error probability p . We calculated the performance for distances d = 5(red), 7 (green), 9 (blue), and 11 (cyan). (a) The case for the bit-flip noise with p = 0 .
1. Note that there are no linesof MWPM decoder since the MWPM decoder is equivalent to the MD decoder in this setting. (b) The case for thedepolarizing noise with p = 0 . . For both ofthe noise models, the performance is near-optimal whenthe distance is small. On the other hand, when thedistance becomes large, the logical error probability be-comes larger than that of the MWPM decoder. The errorthreshold is usually estimated with the cross point of theperformance in terms of the distance. We see that the er-ror threshold based on the distance is worse than that ofthe MWPM decoder, though the logical error probabilityis smaller than that of the MWPM decoder.The actual experiment is expected to be performedwith a physical error probability sufficiently smaller thanthe threshold value. Therefore, we calculated the perfor-mance of the decoder with a small physical error proba-bility. The numerical results are shown in Fig. 7. Sincethe training data set generated with a small value of p ishighly imbalanced, we trained the model with p = 0 . p = 0 .
11 for thedepolarizing noise model. Then, we tested the trainedmodel with the data set generated with p ≤ .
1. Wesee that the logical error probability is smaller than theMWPM decoder in this region, for all the distances ex-cept d = 11.We also calculated the performance of the neural de-coder for two types of color codes. We chose the size oftraining data set as 10 , and calculated the logical er-ror probability for the distance d = 3 , , ,
9. Note thatwe cannot construct an efficient MD decoder in the colorcode even under independent bit-flip and phase-flip noise.The plots of the logical error probability to the physicalerror probability p are shown in Fig. 8. The configura- tions of the plots and lines are the same as that for thesurface code. In the case of the bit-flip noise, the near-optimal performance is achieved. The performance is alsonear-optimal in the case of the depolarizing noise at dis-tances except d = 9. We also see that the performance ofthe [4,8,8]-color code is better than that of [6,6,6]-colorcode. We speculate that this is because the number ofthe physical qubits required in the [4,8,8]-color codes issmaller than that of the [6,6,6]-color code at the samedistance. These results suggest that the neural decoderwith the uniform data construction is effective also forthe color codes. IV. UTILIZING SPATIAL INFORMATION
In this section, we describe the construction of the neu-ral network with convolutional layers. We first discusshow the required size of the data set is expected to besuppressed if the model can utilize the spatial informa-tion of the two-dimensional quantum codes. Then, weintroduce a construction of the neural network with con-volutional layers to utilize the spatial information of thetopological codes. We finally show the numerical results,and show that the performance of the neural decoder isimproved.
A. Importance of the spatial information
In this section, we utilize spatial information of thesyndrome by using Convolutional Neural Network (CNN)as a prediction model. When we use MLP model, each5 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) FIG. 6: The performance comparison between the neural decoder with the uniform construction (solid lines), theMD decoder (dashed lines), and the MWPM decoder (dotted lines) in the case of the [[ d , , d ]] surface code. Wecalculated the performance for distances d = 5 (red), 7 (green), 9 (blue), and 11 (cyan) with the same 10 trainingdata set. (a) The case of the bit-flip noise. (b) The case of the depolarizing noise. Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) FIG. 7: The performance comparison between the neural decoder with the uniform construction (solid lines), theMD decoder (dashed lines), and the MWPM decoder (dotted lines) in the case of the [[ d , , d ]] surface code. Theneural decoder is trained with the 10 training data set. We calculated the performance for distances d = 5 (red), 7(green), 9 (blue), and 11 (cyan). (a) The case of the bit-flip noise. The training data set is generated at the physicalerror probability p = 0 .
08. (b) The case of the depolarizing noise. The training data set is generated at the physicalerror probability p = 0 . L o g i c a l e rr o r p r o b a b ili t y [4,8,8]-color code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [4,8,8]-color code - Depolarizing noise (b) L o g i c a l e rr o r p r o b a b ili t y [6,6,6]-color code - Bit-flip noise (c) L o g i c a l e rr o r p r o b a b ili t y [6,6,6]-color code - Depolarizing noise (d) FIG. 8: The performance comparison between the neural decoder with the uniform construction (solid lines), andthe MD decoder (dashed lines) in the color codes. We calculated the performance for distances d = 3 (black), 5(red), 7 (green), and 9 (blue) with the 10 training data set. (a) The case of the bit-flip noise in the [4,8,8]-colorcode. (b) The case of the depolarizing noise in the [4,8,8]-color code. (c) The case of the bit-flip noise in the[6,6,6]-color code. (d) The case of the depolarizing noise in the [6,6,6]-color code.codes. In the case of the two-dimensional topologicalcodes, the syndrome values have natural two-dimensionalarrangement. By carefully reshaping syndrome values asa matrix-shaped arrangement of the feature vector ele-ments, we can explicitly let the model use local correla-tions of the observed syndromes using the CNN model.In the topological codes, a flip of a single physical qubitinvokes at most a constant number (2 in the surface code,3 in the color code) of local bit-flips in the syndromevalue. This implies that whether two (or three) flippedsyndrome bits are found in a local region or not is usefulfor predicting the property of the physical errors. For in-tuitive understanding, we elaborate the reason throughexamples. We consider the surface code under bit-fliperrors. Suppose that a syndrome vector s is given in theprediction phase, and the model has encountered slightlydifferent syndrome vectors s A and s B , where the differ-ence from s are shown in Fig. 9, in the training phase.The representation is the same as that of Fig. 4. We ignore boundary effects of the topological codes for sim-plicity. In both cases, the syndrome is two hamming dis-tance away from the original syndrome vector s , namely, h ( s ⊕ s A ) = h ( s ⊕ s B ) = 2. On the other hand, there is adifference between the two syndromes in light of whetherit helps the prediction of the diagnosis for the observedsyndrome s . In Eq. (59), we introduced a set of physi-cal errors S ( s , w ∗ ; h ) such that any e ∈ S ( s , w ∗ ; h ) with h ∼ N ( H g ) − produces a training data useful for esti-mating the L2 diagnosis vector of s . For a given s and s (cid:48) , if there is no vector e δ such that H c Λ e T δ = s ⊕ s (cid:48) and h ( e δ ) (cid:46) N ( H g ) − , we see no errors e with s ( e ) = s (cid:48) arecontained in ∪ w ∈{ , } k S ( s, w ; N ( H g ) − ). In the case of s A and s B , there is such a physical error e δ with a smallhamming weight for s A , but not for s B . Thus, if the pre-diction model can distinguish the samples with s A fromthose with s B , it can recognize that the samples with s B in the training data set are not relevant to the pre-diction for s . The CNN model can distinguish it since it7FIG. 9: Example of the difference of the syndromevalues. Each node in the figure corresponds to eachsyndrome value, and edge corresponding to the errorstatus of each physical qubit. The color of the circlecorresponds to whether the syndrome measurementdetects an error.naturally utilizes the spatial information of the syndromevalues. On the other hand, the MLP model cannot easilydistinguish them since the model is not provided with therelevant spatial structure before training. This discussionimplies the logical error probability under the fixed num-ber of the training data set is expected to be improvedwith the use of a CNN model. B. Construction of the network
A convolutional neural network extracts patterns fromimage data through trainable filters that activate (pro-duce a high value) when there are specific local patternsin the input data. The network usually consists of mul-tiple convolutional layers C ( n ) each of which consists ofdifferent filtered versions of the image data C ( n ) p , indexedby a channel number p . The ( n − Q chan-nels are filtered to the n -th layer with P channels with Q × P filters which we can be represented by a matrix f ( n − ,n ) . We can describe this relation as follows. C ( n ) i,j,p = A ( (cid:88) d x (cid:88) d y (cid:88) q f ( n − ,n ) d x ,d y ,q,p C ( n − i + d x ,j + d y ,q + b ( n ) p ) , (90)where C ( n ) i,j,p is the ( i, j ) element of the p -th channel inthe n -th convolutional layer, and f ( n − ,n ) d x ,d y ,q,p is the ( d x , d y )element in the ( q, p )-th filter from the ( n − n -th layer. Parameter b ( n ) p is the bias added tothe p -th channel of the n -th layer. A simple example isshown in Fig. 10 where one layer has three channels andthe next layer has two channels.To use a CNN in our decoding task, we have to ex-press the syndrome vector s with an appropriate matrixrepresentation. We reallocate the syndrome vector forthe [[2 d − d + 1 , , d ]] and [[ d , , d ]] codes as shown inFig. 11. For the [[2 d − d +1 , , d ]] surface code, s is con-verted into two d × ( d −
1) matrices for the X syndromeand the Z syndrome. Similarly, for the [[ d , , d ]] surfacecode, s is converted into two ( d − × ( d + 1) / d for the first two lay-ers and 5 d for the last layer. Details about the modelarchitecture is described in Appendix B. It is worth not-ing that we used the same filters for decoding of both X and Z flip errors, and max-pooling is not used as it isobserved to reduce the performance of the decoder. C. Numerical result
We call a neural decoder with the MLP model as aMLP decoder, and one with the CNN model as a CNNdecoder. We compare the performance of the CNN de-coder with those of the MLP decoder, MD decoder, andMWPM decoder. Note that the training data set is gen-erated with the uniform data construction.First, we compare the performance of the CNN decoderand that of the MLP decoder in the case of the surfacecodes. The numerical results are shown in Fig. 13. Inthis figure, the solid lines and dashed lines are the logi-cal error probability for the CNN decoder and the MLPdecoder, respectively. The colors red, green, blue, andcyan correspond to distances d = 5, 7, 9 and 11, respec-tively. For both types of the surface codes, the CNNdecoder shows superior performances to that of the MLPdecoder at large distances. In particular, in the case ofthe [[2 d − d + 1 , , d ]] surface code, the CNN decoder8 FIG. 10: A simple case of convolutional layer where theinput channel is three and the output channel is two.shows significant improvement of a logical error probabil-ity. We see that the CNN model is effective for improvingthe performance of the neural decoder at large distances.On the other hand, we see that the CNN decoder showsinferior performances to the MLP decoder at a small dis-tance. We speculate the reason of this as follows. TheCNN model assumes that the local features can be ex-tracted by using the same filter everywhere. Such anassumption is not necessarily true when the distance issmall, since almost all the filtered local regions, of size3 × . −
1, but the performance in the smalldistance did not improve.Next, we compared the performance of the CNN de-coder with those of the MD decoder and the MWPM de-coder. The results are shown in Fig. 14. The solid lines,the dashed lines, and the dotted lines are the logical errorprobability for the CNN decoder, the MD decoder, andthe MWPM decoder, respectively. The colors red, green,blue, and cyan correspond to distances d = 5, 7, 9 and11, respectively. In the case of the bit-flip noise, we seethat the logical error probabilities of the CNN decoder isequal to or slightly better than that of the MD decoder.In the case of the depolarizing noise, though there aregaps between the performances of the CNN decoder andthe MD decoder, the performance of the CNN decoderis superior or comparable to that of the MWPM decodereven at the distance d = 11.We also calculated the logical error probability of theCNN decoder at a small physical error probability p inthe case of the [[2 d − d + 1 , , d ]] surface code. Wetrained the CNN decoder at p = 0 .
08 for the bit-flip noisemodel, and at p = 0 .
11 for the depolarizing noise model.Then, the decoder is tested with the data set generatedwith small physical error probabilities. The plots are shown in Fig. 15. In the case of the bit-flip noise, theCNN decoder achieves the performance close to the MDdecoder also at small physical error probabilities. In thecase of the depolarizing noise, the performance of theneural decoder with CNN decoder is superior to that ofthe MWPM decoder at d = 9, and comparable at d = 11.We can say that the CNN model is effective also for a useof neural decoders at small physical error probabilities. V. CONCLUSION
In this paper, we theoretically analyzed mechanism ofmachine-learning-based decoders for QEC, and proposeda general direction to construct the data set and the neu-ral network. Then, we have numerically shown that ourdirection is effective compared with the existing works.Since the formalism of the machine learning is flexi-ble, there are many possible ways to reduce the decodingproblem in QEC to the task of the machine learning. Inorder to clarify what is the best way of reduction, weintroduced the linear prediction framework. This frame-work essentially includes the existing methods as specificcases, and enables us to discuss conditions for satisfy-ing natural requirements for a good decoder for QEC.In particular, we have derived the condition to performthe optimal decoding in the limit of a large training datasize. We also introduced a measure, normalized sensi-tivity, which represents a properly-scaled bound on thedeviation in the prediction target resulting from a smallchange in the physical error pattern. We proposed touse this measure as a criterion for constructing a betterdecoder. We then proposed a general direction for con-structing the data set, uniform data construction, whichcan be applicable to general topological codes. We nu-merically confirmed that the performance of the neuraldecoder is improved with the uniform data construc-tion. Our decoder was found to be superior to knownefficient decoders, such as neural decoders proposed inthe existing methods and the decoder based on the re-duction to minimum-weight perfect matching. We alsoconfirmed that the performance of our neural decoder isnear-optimal in various situations by comparing it withthe minimum-distance decoder, which is known to benear-optimal but not efficient in general. We also con-firmed that the neural decoder can achieve near-optimalperformance not only for surface codes but also for colorcodes.Another important factor of the neural decoder is con-struction of the neural network. We discussed the im-portance of the spatial information of the syndrome mea-surement in order to let the prediction model recognizeuseful samples from a given training data set. To utilizethe spatial information, we proposed a neural decoderwith the convolutional neural network. We numericallyobserved that the performance of the neural decoder isfurther improved with this network construction in thesurface code. In particular, we showed that the proposed9FIG. 11: The above figure shows how to split and reallocate the syndrome vectors to the two input layer of theneural network. In the case of the [[2 d − d + 1 , , d ]]-code, the lattice is split into a ( d − × d array of syndrome,and π/ d × ( d −
1) matrix as the first layer of the neural network. In the case of the[[ d , , d ]]-code, we split the syndromes into two ( d − × ( d +1)2 arrays. Convolutional layer 1 Convolutional layer 2 Convolutional layer 3 Hidden layer Concatenation Prediction
FIG. 12: CNN decoder architecture used in our work.We separately pass X and Z syndrome values throughthe same convolutional layers, and concatenate thembefore feeding to the following fully-connected hiddenlayer.neural decoder achieves a smaller logical error probabil-ity than that of the decoder based on minimum-weightperfect matching even at distance d = 11 with a trainingdata set size of 10 .Since using machine learning for QEC is an emergentfield, there are still many possible extensions and direc-tions of the neural decoders. As we detailed in Appendix B, the prediction time of the neural decoders is smallerthan that of the MD decoder, but larger than that ofthe MWPM decoder in our desktop PC. Since the pre-diction of the neural decoders can be done with simplematrix multiplications, the time for prediction can be fur-ther made short by using an optimized hardware such asfield-programmable gate array (FPGA), which is popu-larly used in experiments. While we have discussed onlya label linearly generated in GF(2), the performance maybe more improved by allowing labels nonlinearly gener-ated from the physical error. For example, the relationbetween the syndrome values and the weight of the phys-ical error, which cannot be generated linearly in GF(2),can be trained and predicted independently with a neuralnetwork. Then, the recovery map can be predicted withthe syndrome values and the predicted weight with an-other neural network. The linear prediction frameworkalso limits the sample in the training data set to thatis sampled from the assumed physical error distribution.However, the distribution which is the best for the train-ing is not necessarily the same as the actual distribution.For example, we saw that the prediction model trained atthe physical error probability around the threshold valueshows high-performance also at low physical error prob-abilities. There can be a more artificial way to constructthe training data set to achieve the performance with asmaller size of the training data set. In the numericalinvestigation, we observed that the required amount ofthe data set becomes exponentially large in terms of thedistance. This may be suppressed by renormalizing the0 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Bit-flip noise (c) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Depolarizing noise (d) FIG. 13: The performance comparison between the CNN decoder (solid lines) and the MLP decoder (dashed lines)in the case of the surface code. We calculated the performance for distance d = 5 (red), 7 (green), 9 (blue), and 11(cyan). (a) The bit-flip noise in the [[ d , , d ]] code. (b) the depolarizing noise in [[ d , , d ]] code. (c) the bit-flip noisein [[2 d − d + 1 , , d ] code. (d) the depolarizing noise in [[2 d − d + 1 , , d ]] code.matrix representation of the syndrome with trained fil-ters as done in the renormalization group decoder [19].We expect that CNN is also applicable to the color codesby using non-rectangle filters. When the stabilizer mea-surements themselves suffer from noise, stabilizer mea-surements are often repetitively performed during QEC.In such a case, the length of the syndrome data is notfixed. In our construction, we need to train the neuralnetwork again whenever the length of the syndrome datachanges. The studies of Refs. [26, 28] focused on remov-ing this drawback by utilizing recurrent neural networkand convolutional neural network. Using the techniqueproposed in Refs. [26, 28], our neural decoder may be alsoapplicable to the cases when we perform repetitive sta-bilizer measurements. ACKNOWLEDGEMENTS
This work is supported by KAKENHI Grant No.16H02211; PRESTO, JST, No. JPMJPR1668; CREST,JST, Grants No. JPMJCR1671, No. JPMJCR1673;ERATO, JST, Grant No. JPMJER1601; and PhotonFrontier Network Program, MEXT. Y.S. is supported byAdvanced Leading Graduate Course for Photon Science.D.A. and Y.S. contributed equally to this work. Y.S.contributes to the construction of the data set in Sec. III,and A.D. to the construction of the network in Sec. IV.K.F. and M.K. motivate and supervise the idea and dis-cussion of this paper.1 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Bit-flip noise (c) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Depolarizing noise (d) FIG. 14: The performance comparison between the CNN decoder (solid lines), the MD decoder (dashed lines), andthe MWPM decoder (dotted lines) in the surface codes. We calculated the performance for distance d = 5 (red), 7(green), 9 (blue), and 11 (cyan). (a) the bit-flip noise in the [[ d , , d ]] code. (b) the depolarizing noise in [[ d , , d ]]code. (c) the bit-flip noise in [[2 d − d + 1 , , d ]] code. (d) the depolarizing noise in [[2 d − d + 1 , , d ]] code. APPENDIX A: PROOF OF THE LEMMASProof of the converse part in Lemma III.1
Here we prove the last statement of Lemma III.1.When Eq. (19) does not hold, either (i) there exists e such that e / ∈ L , (91) H cg Λ e T1 = 0 , (92)or (ii) there exists e such that e ∈ L , (93) H cg Λ e T1 (cid:54) = 0 . (94)For (i), consider two probability distributions { p e } and { p (cid:48) e } such thatPr e ∼{ p e } [ e = 0 | s ( e ) = 0] = 0 . , (95)Pr e ∼{ p e } [ e = e | s ( e ) = 0] = 0 . , (96)and Pr e ∼{ p (cid:48) e } [ e = 0 | s ( e ) = 0] = 0 . , (97)Pr e ∼{ p (cid:48) e } [ e = e | s ( e ) = 0] = 0 . . (98)An optimal decoder for each case succeeds with probabil-ity 0 .
75 given s = 0. On the other hand, since g ( δ ) (0) = 0in both cases, only the value of r ∗ (0 ,
0) is relevant. Since w (0) (cid:54) = w ( e ), any choice of r ∗ (0 ,
0) leads to a successprobability no greater than 0.25 for at least one of thecases.For (ii), choose w (cid:54) = 0, and if H g Λ( w G ) T (cid:54) = 0, define e := w G. (99)2 Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Bit-flip noise (a) Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Depolarizing noise (b) FIG. 15: The performance comparison between the CNN decoder (solid lines) and the MD decoder (dashed lines),and the MWPM decoder (dotted lines) in the case of the [[2 d − d + 1 , , d ]] surface code, where the decoders aretrained with the training data set generated at the fixed error rate. We calculated the performance for distance d = 5 (red), 7 (green), 9 (blue), and 11 (cyan). (a) The case of the bit-flip noise. The training data set is generatedat the physical error probability p = 0 .
08. (b) The case of the depolarizing noise. The training data set is generatedat the physical error probability p = 0 . e := e ⊕ w G. (100)It ensures that s ( e ) = 0 and g := H g Λ e (cid:54) = 0. Considertwo probability distributions { p e } and { p (cid:48) e } such thatPr e ∼{ p e } [ e = 0 | s ( e ) = 0] = 0 . , (101)Pr e ∼{ p e } [ e = e | s ( e ) = 0] = 0 . , (102)Pr e ∼{ p e } [ e = e | s ( e ) = 0] = 0 . , (103)and Pr e ∼{ p (cid:48) e } [ e = 0 | s ( e ) = 0] = 0 . , (104)Pr e ∼{ p (cid:48) e } [ e = e | s ( e ) = 0] = 0 . , (105)Pr e ∼{ p (cid:48) e } [ e = e | s ( e ) = 0] = 0 . . (106)An optimal decoder for each case succeeds with probabil-ity 0 . s = 0. On the other hand, since g ( δ ) (0) = g in both cases, only the value of r ∗ ( g ,
0) is relevant. Since w (0) (cid:54) = w ( e ), any choice of r ∗ ( g ,
0) leads to a successprobability no greater than 0 . Proof of the converse part in Lemma III.2
When the diagnosis matrix is not decomposable, thereexists a non-empty subset
W ⊂ { , } k such that (cid:88) w ∈W α w g ( w ) = (cid:88) w ∈{ , } k \W β w g ( w ) , (107)where α w , β w ≥ (cid:88) w ∈W α w = (cid:88) w ∈{ , } k \W β w > . (108)Consider two probability distributions { p e } and { p (cid:48) e } such thatPr e ∼{ p e } A [ w ( e ) = w , l ( e ) = l | s ( e ) = 0] = (cid:40) α w / Γ w ∈ W , l = 00 otherwise , (109)Pr e ∼{ p (cid:48) e } B [ w ( e ) = w , l ( e ) = l | s ( e ) = 0] = (cid:40) β w / Γ w / ∈ W , l = 00 otherwise . (110)From Eq. (107), the L2 diagnosis vector g (L2) (0) isidentical for the two distributions. On the other hand,the most probable class w is different for the two prob- ability distributions. This means that a single decodercannot perform the optimal decoding for both of the twodistributions.3 Proof of the existence of faithful and decomposablediagnosis matrices
In the main text, we showed that a diagnosis matrixshould be faithful and decomposable for performing op-timal decoding in the ideal limit of the training process,and showed an example for the case k = 1. On the otherhand, it is not trivial that there exists a faithful anddecomposable construction of a diagnosis matrix for anarbitrary stabilizer code. We show that diagnosis matrix H g = W G , where W is a 2 k × k binary matrix of whichthe i -th row is a 2 k -bit binary representation of an integer i , is always faithful and decomposable for an arbitrarystabilizer code and for an arbitrary number of logicalqubits k . Since row vectors of H g contains all the logicaloperators, it is trivial that span( { ( H cg ) i } ) is equivalentto the logical space L , and H g is faithful. The condi-tion for decomposability is equivalent to the conditionthat { g ( w ) | w ∈ { , } k } , where g ( w ) := H g Λ( w G ) T ,is affinely independent in real vector space. To showthe latter, we first prove that for any pair of binary vec-tors w , w (cid:48) ∈ { , } k such that w (cid:54) = w (cid:48) , the weight of g ( w ) ⊕ g ( w (cid:48) ) is 2 k − . A 2 k -bit sequence g ( w ) ⊕ g ( w (cid:48) )is given by g ( w ) ⊕ g ( w (cid:48) ) = W G Λ G T ( w ⊕ w (cid:48) ) T . (111)Since matrix G Λ G T is invertible and since w ⊕ w (cid:48) (cid:54) = 0,we have G Λ G T ( w ⊕ w (cid:48) ) T (cid:54) = 0. Since matrix W containsall the possible 2 k -bit sequence, the half elements in thesequence g ( w ) ⊕ g ( w (cid:48) ) are 1, and the others are 0. Thus,the weight of g ( w ) ⊕ g ( w (cid:48) ) is 2 k − .Let v := (1 , . . . , T be a real vector of order 2 k .We define a set of vectors h ( w ) := 2 g ( w ) − v for w ∈ { , } k , where this calculation is done in real vectorspace. Note that this map is equivalent to replace 0 and1 to 1 and −
1, respectively. Since this map from g ( w ) to h ( w ) is affine, { g ( w ) } is affinely independent if { h ( w ) } is linearly independent. The inner product h ( w ) h ( w (cid:48) ) T for w (cid:54) = w (cid:48) can be calculated as h ( w ) h ( w (cid:48) ) T = (cid:88) i h ( w ) i h ( w (cid:48) ) i = 2 k − w ( g ( w ) ⊕ g ( w (cid:48) ))= 0 . (112)We used the fact that h ( w ) i h ( w (cid:48) ) i is 1 if g ( w ) i = g ( w (cid:48) ) i ,and − { h ( w ) } is lin-early independent in the real vector space, and the set ofvectors { g ( w ) } is affinely independent. This means that H g = W G is faithful and decomposable for an arbitrarystabilizer code.
APPENDIX B : ADDITIONAL INFORMATIONFOR THE IMPLEMENTATION OF THEDECODERS
We describe the detail of the implementation of ourmodel, training process, and decoders for the reference.
Distance Filter size Channelnumber Neuronnumber5 [2x2],[3x3],[3x3] 50, 50, 25 10007 [2x2],[3x3],[4x4] 70, 70, 35 30009 [3x3],[4x4],[5x5] 90, 90, 45 500011 [4x4],[5x5],[6x6] 110, 110, 55 7000
TABLE I: Network architecture of the[[2 d − d + 1 , , d ]] surface code.We chose rectified linear units (ReLU(x)=max(0,x)) anda sigmoid function (S(x)=1 / (1 + e − x )) as the activationfunction for the hidden layer and that for the final outputlayer, respectively. Batch normalization was deployed inall of our models and was found to be effective. We alsoused L2 regularization to avoid over-fitting of the model.In the training phase, the Adam optimization method[33] was used. The learning rate was exponentially de-creased, and its schedule was optimized by hand. Thenetwork was built with the tensorflow v1.2 platform. Details about the multilayer perceptron
We optimized the following parameters of multilayerperceptron using grid-search: number of neurons perlayer ( β ). The parameters were searched in therange ∈ { d , d , d } , β ∈ { , . , . } , batch ∈{ , } , and ∈ { , , } . Note that in the caseof d = 11, we tuned d due to the memory limit of GPU.We started the training with learning rate 10 − , and itwas decreased to 10 − according to a schedule which wasoptimized by hand. We optimized these parameters foreach construction of the diagnosis matrix, distance, phys-ical error probability, error model, and size of the train-ing data set. We chose the configuration which achievesthe smallest logical error probability for an independentlygenerated a validation data set of size 10 . Then, the log-ical error probability is calculated using another 10 testdata set. Details about the Convolutional Neural Network
Our CNN model consists of three convolutional layerson top of a single fully-connected hidden layer. For eachconvolutional layer, the channel number was chosen tobe 10 d for the first two layers and 5 d for the last layer.We chose batch size as 100 in the training of the CNNmodel.The network architecture was the same for both bit-flipand depolarizing noise models in the [[2 d − d + 1 , , d ]]surface code, and is described in TABLE I. As for the[[ d , , d ]] code with the bit-flip and depolarizing noise4 Distance Filter size Channelnumber Neuronnumber5 [2x2],[3x3],[3x3] 50, 50, 25 10007 [2x2],[3x3],[3x4] 70, 70, 35 30009 [2x3],[3x4],[4x5] 90, 90, 45 500011 [2x4],[3x5],[4x6] 110, 110, 55 7000
TABLE II: Network architecture of the [[ d , , d ]]surface code.models, we used the network architecture described inTABLE II. The filter stride was set to 1 in all directions. Implementation of the minimum-distance decoder
The minimum-distance decoder of the surface code un-der the bit-flip noise can be implemented by reducingthe problem into the minimum-weight perfect match-ing. The minimum-weight perfect matching can be ef-ficiently solved with Blossom algorithm [31]. We usedKolmogorov’s implementation of Blossom algorithm [34].In the other cases, we reduced the problem into the fol-lowing instance of integer programming.Minimize w ( e ) s . t . H c Λ e T = s (113)This problem was solved with IBM ILOG CPLEX. Weobtained at least 10 samples for each plot. In all thecases, the solver reached the optimal solution. Time for single prediction, implementation andenvironment
We measured the time for single decoding on the[[2 d − d + 1 , , d ]] surface code with d = 11 and p = 0 .
15 under the depolarizing noise for the MD de-coder, MWPM decoder, and the proposed neural de-coders with the MLP and CNN models. Note that thetimes of the MD decoder and the MWPM decoder de-pend on the physical error probability.We used IBM ILOG CPLEX via python-wrapper forconstructing the MD decoder. The program was exe-cuted on Intel Xeon E5-2687W v4 with default settings.The MD decoder takes about 330 milliseconds per decod-ing. Note that the time may be improved by optimizingthe settings of CPLEX.The Kolmogorov’s implementation of Blossom algo-rithm [31, 34] was used for the MWPM decoder. Wecompiled the codes with Microsoft Visual C++ 2015 andwith O2 option. The program was executed on IntelCore i7-6700 without parallelization. The MWPM de-coder took about 56 microseconds per decoding. The proposed neural decoders were implemented withpython and tensorflow. We measured the time for sin-gle prediction when we set batch size as 1, the numberof layer as 2, the number of units per layer as 7000 forthe MLP model. The configuration of the CNN modelis shown in TABLE I. The computation was performedusing Intel Core i7-6700 and GeForce GTX 1060 6GB.The proposed neural decoders with the MLP and CNNmodels took 2.2 milliseconds and 7 milliseconds, respec-tively, for feed-forwarding the input data and finding themost probable class w . Since the prediction of the neuraldecoders can be done with simple matrix multiplications,we expect that the time for single prediction of the neu-ral decoder can be made shortened by using an optimizedhardware, such as FPGA, for example. APPENDIX C : THE SPECIFIC CHOICES OFTHE UNIFORM DATA CONSTRUCTION
We have introduced the uniform data construction inSec. III. In this appendix, we show specific uniform dataconstruction for the surface and color codes.We choose 3 d logical operators for the [[2 d − d +1 , , d ]] surface code by using two patterns as shown inFig. 16. For pattern 1, each dotted line corresponds to alogical X operator, which is the product of the Pauli Z operators on the vertices on the line. For pattern 2, eachdotted line corresponds to a logical Z operator, whichis the product of the Pauli X operators on the verticeson the line. We choose d logical Y operators written asthe product of the i -th logical X operator and the i -thlogical Z operator for i = 0 , . . . , d −
1. We choose 3 d logical operators for the [[ d , , d ]] surface code with twopatterns as shown in Fig. 17. The rule of choice is thesame as that of the [[ d , , d ]] surface code.We choose ( d + 1) logical operators for the [6,6,6]-color code as shown in Fig. 18. There are ( d + 1) linesfor each pattern. In all of the three patterns, each linecorresponds to the logical X -, Z -, and Y -operators on thephysical qubits on the line. We choose 6( d + 1) logicaloperators for the [4,8,8]-color code as shown in Fig. 19.There are ( d + 1) lines for each pattern. The choice ofthe logical operators is the same as that of the [6,6,6]-color codes.In all the patterned choice of the logical operators, wecan verify that the sensitivity is constant, since everyphysical qubit is measured by at most constant numberof logical operators. On the other hand, the minimumboundary distance is scaled as O ( d ), since the same num-ber O ( d ) of logical X -, Y -, and Z -operators are used.Thus, the normalized sensitivity is scaled as O ( d − ) withthese choices.5 (a) Pattern 1 (b) Pattern 2 FIG. 16: Logical operators used for the construction of a diagnosis matrix for the surface codes [[2 d − d + 1 , , d ]].Each dotted black line corresponds to a chosen logical operator. (a) Pattern 1 (b) Pattern 2 FIG. 17: Logical operators used for the construction of a diagnosis matrix for the surface codes [[ d , , d ]]. Eachdotted black line corresponds to a chosen logical operator. (a) Pattern 1 (b) Pattern 2 (c) Pattern 3 FIG. 18: Logical operators used for the construction of a diagnosis matrix for the [6,6,6]-color codes. Each coloredline corresponds to chosen logical operators. The lines are colored only for visibility, and are not related to thecolors of color codes.6 (a) Pattern 1 (b) Pattern 2(c) Pattern 3 (d) Pattern 4
FIG. 19: Logical operators used for the construction of a diagnosis matrix for the [4,8,8]-color codes. Each coloredline corresponds to chosen logical operators. The lines are colored only for visibility, and are not related to thecolors of color codes. [1] A. Y. Kitaev, Russian Mathematical Surveys , 1191(1997).[2] D. Aharonov and M. Ben-Or, in Proceedings of thetwenty-ninth annual ACM symposium on Theory of com-puting (ACM, 1997) pp. 176–188.[3] E. Knill, R. Laflamme, and W. H. Zurek, in
Proceedingsof the Royal Society of London A: Mathematical, Physicaland Engineering Sciences , Vol. 454 (The Royal Society,1998) pp. 365–384.[4] J. Kelly, R. Barends, A. Fowler, A. Megrant, E. Jeffrey,T. White, D. Sank, J. Mutus, B. Campbell, Y. Chen, et al. , Nature , 66 (2015).[5] A. C´orcoles, E. Magesan, S. J. Srinivasan, A. W. Cross,M. Steffen, J. M. Gambetta, and J. M. Chow, Naturecommunications (2015).[6] D. Rist`e, S. Poletto, M.-Z. Huang, A. Bruno, V. Vester-inen, O.-P. Saira, and L. DiCarlo, Nature communica-tions (2015).[7] A. Y. Kitaev, Annals of Physics , 2 (2003).[8] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, Jour- nal of Mathematical Physics , 4452 (2002).[9] D. A. Lidar and T. A. Brun, Quantum error correction (Cambridge University Press, 2013).[10] S. B. Bravyi and A. Y. Kitaev, arXiv preprint quant-ph/9811052 (1998).[11] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N.Cleland, Physical Review A , 032324 (2012).[12] C. Wang, J. Harrington, and J. Preskill, Annals ofPhysics , 31 (2003).[13] D. S. Wang, A. G. Fowler, and L. C. L. Hollenberg,Physical Review A , 020302 (2011).[14] A. G. Fowler, A. C. Whiteside, and L. C. L. Hollenberg,Phys. Rev. Lett. , 180501 (2012).[15] A. M. Stephens, Physical Review A , 022321 (2014).[16] M.-H. Hsieh and F. Le Gall, Physical Review A ,052331 (2011).[17] H. Bombin and M. A. Martin-Delgado, Journal of math-ematical physics , 052105 (2007).[18] N. Delfosse, Physical Review A , 012317 (2014).[19] G. Duclos-Cianci and D. Poulin, Physical review letters , 050504 (2010).[20] E. Magesan, J. M. Gambetta, A. C´orcoles, and J. M.Chow, Physical review letters , 200501 (2015).[21] G. Carleo and M. Troyer, Science , 602 (2017).[22] J. Carrasquilla and R. G. Melko, Nature Physics , 431(2017).[23] J. Romero, J. P. Olson, and A. Aspuru-Guzik, QuantumScience and Technology , 045001 (2017).[24] G. Torlai and R. G. Melko, Physical Review Letters ,030501 (2017).[25] S. Varsamopoulos, B. Criger, and K. Bertels, QuantumScience and Technology , 015004 (2017).[26] P. Baireuther, T. E. O’Brien, B. Tarasinski, and C. W.Beenakker, Quantum , 48 (2018). [27] S. Krastanov and L. Jiang, Scientific reports , 11003(2017).[28] N. P. Breuckmann and X. Ni, Quantum , 68 (2018).[29] D. Gottesman, arXiv preprint quant-ph/9705052 (1997).[30] G. Duclos-Cianci and D. Poulin, in Information TheoryWorkshop (ITW), 2010 IEEE (IEEE, 2010) pp. 1–5.[31] J. Edmonds, Canadian Journal of mathematics , 449(1965).[32] K. Hornik, M. Stinchcombe, and H. White, Neural net-works , 359 (1989).[33] D. Kingma and J. Ba, arXiv preprint arXiv:1412.6980(2014).[34] V. Kolmogorov, Mathematical Programming Computa-tion1