[PDF] General framework for constructing fast and near-optimal machine-learning-based decoder of the topological stabilizer codes

Abstract

Quantum error correction is an essential technique for constructing a scalable quantum computer. In order to implement quantum error correction with near-term quantum devices, a fast and near-optimal decoding method is demanded. A decoder based on machine learning is considered as one of the most viable solutions for this purpose, since its prediction is fast once training has been done, and it is applicable to any quantum error correcting code and any noise model. So far, various formulations of the decoding problem as the task of machine learning have been proposed. Here, we discuss general constructions of machine-learning-based decoders. We found several conditions to achieve near-optimal performance, and proposed a criterion which should be optimized when a size of training data set is limited. We also discuss preferable constructions of neural networks, and proposed a decoder using spatial structures of topological codes using a convolutional neural network. We numerically show that our method can improve the performance of machine-learning-based decoders in various topological codes and noise models.

Full PDF

GGeneral framework for constructing fast and near-optimal machine-learning-baseddecoder of the topological stabilizer codes

Amarsanaa Davaasuren, ∗ Yasunari Suzuki,

1, 2, † Keisuke Fujii,

3, 4, ‡ and Masato Koashi

1, 2, § Department of Applied Physics, Graduate School of Engineering,The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan Photon Science Center, Graduate School of Engineering,The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan JST, PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan Department of Physics, Graduate School of Science, Kyoto University,Kitashirakawa-Oiwakecho, Sakyo, Kyoto 606-8502, Japan (Dated: February 19, 2019)Quantum error correction is an essential technique for constructing a scalable quantum computer.In order to implement quantum error correction with near-term quantum devices, a fast and near-optimal decoding method is demanded. A decoder based on machine learning is considered as oneof the most viable solutions for this purpose, since its prediction is fast once training has been done,and it is applicable to any quantum error correcting code and any noise model. So far, variousformulations of the decoding problem as the task of machine learning have been proposed. Here,we discuss general constructions of machine-learning-based decoders. We found several conditionsto achieve near-optimal performance, and proposed a criterion which should be optimized when asize of training data set is limited. We also discuss preferable constructions of neural networks,and proposed a decoder using spatial structures of topological codes using a convolutional neuralnetwork. We numerically show that our method can improve the performance of machine-learning-based decoders in various topological codes and noise models.

I. INTRODUCTION

In order to build a scalable quantum computer, quan-tum error correction (QEC) [1–3] is a vital technique forachieving reliable computation. According to the theoryof QEC, if the noise strength is smaller than a certainthreshold value, we can protect logical qubits encodedin physical qubits from the noise. Supported by exten-sive experimental eﬀorts, the noise level of the quantumoperations on arrays of qubits is now approaching andmeets the threshold value. Therefore, a demonstrationof QEC in a fully fault-tolerant settings is considered tobe a milestone for the near-term quantum devices [4–6].Topological codes [7–9] are a family of quantum error cor-recting codes inspired by topological nature in the con-densed matter physics [7]. Since the topological codessuch as surface codes [8, 10, 11] have both high experi-mental feasibility and high performance [12–15], they areconsidered as the most promising candidate of quantumerror correcting codes.In QEC, information on occurrence of physical errors ismeasured as a syndrome value. A suitable recovery oper-ation is estimated from the syndrome so that the originalstate of the logical qubits is decoded with high successprobability. Unfortunately, constructing an optimal de-coder is computationally hard in general. Thus, massiveeﬀorts have been paid for developing eﬃcient and near- ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] optimal decoders. One approach is to use the most likelyphysical errors that are consistent with the observed syn-drome value as a recovery operation. This scheme iscalled the minimum-distance (MD) decoder . Though thisdecoding method is not necessarily optimal, it shows al-most optimal performance [13–15]. In the case of thesurface codes, if we can assume that bit-ﬂip (Pauli X )and phase-ﬂip (Pauli Z ) errors are uncorrelated, we canconstruct an eﬃcient MD decoder using minimum-weightperfect matching. However, if bit-ﬂip and phase-ﬂip er-rors are correlated or if we use other codes, even MDdecoding is not eﬃciently implementable [16]. Some ofthese problems can be avoided by the use of geometricallylocal features of the topological codes. For example, asfor color codes [17], we can perform decoding by project-ing color code to a surface code [18]. Another approach isto use renormalization group method [19], which is appli-cable to any topological codes including the surface andcolor codes. While these approaches have been improved,there is unavoidable trade-oﬀ between the performanceand time eﬃciency of the decoder. For the ﬁrst exper-imental realization of QEC on near-term devices, moreeﬃcient and near-optimal decoders are demanded.In this article, we discuss a general construction ofmachine-learning-based decoders. Recently, the technol-ogy of machine learning has been applied to various theo-retical and experimental researches of quantum physics,such as classiﬁcation of readout signals in experiments[20], simulation of a quantum system [21], classiﬁcationof the phase of matter [22], data compression of the quan-tum state [23], and decoding in QEC [24–28]. In themachine-learning-based decoder, we construct a predic-tion model which outputs a recovery operator from a a r X i v : . [ qu a n t - ph ] F e b given syndrome value. The prediction model is trainedwith many correct pairs of syndrome values and correctrecovery operations before prediction. While the trainingtask may take a long time, it is required only once beforemany runs of prediction, and each prediction is expectedto be performed fast. Thus, the machine-learning-baseddecoder is one of the best solutions for demonstratingexperimental QEC in near-term quantum devices.As a prediction model, artiﬁcial neural network is be-lieved to have large representation power, and is suitablefor constructing machine-learning-based decoder. Re-cently, the performances of machine-learning-based de-coders with various neural networks have been numer-ically studied, such as restricted Boltzmann machine[24], multi-layer perceptron [25], recurrent neural net-work [26], and deep neural network [27]. The machine-learning-based decoder using a neural network is called neural decoders [24]. All these existing methods numeri-cally showed that the performance of the neural decoderis superior to the known eﬃcient decoders when suﬃ-ciently large amount of the training data set is supplied.However, the following three points have yet to be under-stood. The ﬁrst one is how the decoding problem shouldbe translated to the task of machine learning in order toobtain faster learning and better prediction. So far, eachof the previous studies introduces its own construction ofthe data set and neural network with little considerationon this point. Second, the spatial feature of the topolog-ical codes has not been considered in the construction ofthe neural decoder, except a very recent study [28] thatwas carried out independently of this work. While it isexpected that the performance of the neural decoder isimproved by explicitly considering the spatial arrange-ment of the syndrome, the spatial information has notbeen given to the neural network explicitly. Finally, theapplicability of the neural decoder to various topologicalcodes is not known. The neural decoder is benchmarkedonly with surface codes [24–28]. Therefore, it has notbeen known whether the neural decoder is applicable toother codes, such as color codes.We have addressed all of these points in this paper.First, we discuss how the decoding problem should beformulated as the task of machine learning. We pro-pose a general framework for constructing a neural de-coder, linear prediction framework , to elucidate the fac-tors that determine the performance of the decoders. Wepropose a criterion called normalized sensitivity whichshould be optimized for constructing a near-optimal neu-ral decoder. Then, we propose speciﬁc construction of atraining data set which minimizes the normalized sensi-tivity. We call these constructions as uniform data con-struction . We also propose the use of construction ofneural networks, which explicitly utilize spatial structureof the topological codes. We show that the performanceof the neural decoder is improved with these techniques,and it shows better performance than that of a decoderusing minimum-weight perfect matching with 10 dataset at distance d = 11 in the surface code under a depo- larizing noise. We show that the neural decoder is alsoapplicable to the color codes. The performance of theneural decoder for the color codes also reaches that ofthe MD decoder in small distances. Organization of the article

In Sec II, we overview preliminary topics. We reviewa scheme of QEC in the case of stabilizer codes. Weexplain speciﬁc constructions of the topological codes,the surface and color codes. We also review the basicsof the supervised machine learning with neural networksin this section. In Sec III, we address the question ofhow the neural decoder should be constructed. We pro-pose a general framework, linear prediction framework,in this section. We introduce a quantity called the nor-malized sensitivity, and argue that it serves as a criterionfor better performance of decoders for topological stabi-lizer codes. We also propose uniform data construction,which consists of speciﬁc instructions to optimize the nor-malized sensitivity for surface codes and color codes. Wenumerically conﬁrm that the performance of the neuraldecoder is improved with this construction in the caseof the surface and color codes. In Sec IV, we propose anetwork construction which explicitly utilize the spatialinformation of the topological codes. We conﬁrm thatthis construction also improves the performance of theneural decoder. Finally, we summarize this paper in SecV.

II. PRELIMINARY

In this section, we review the basic concepts and in-troduce notations used in this paper. We ﬁrst review ascheme of QEC. We also introduce well-known topolog-ical codes and decoders. The scheme of supervised ma-chine learning with neural network and its terminologiesare also explained in this section.

A. Quantum error correction

We consider the case where k logical qubits are encodedin n physical qubits. We assume that any noise can berepresented as a probabilistic Pauli operation on the n physical qubits. We denote Pauli operators on a singlequbit as { I, X, Y, Z } , and the Pauli operator A on the i -th physical qubit as A i . When we consider operations onthe n physical qubits, we ignore the global phase of thestate and operator. Then, we can represent any physicalerror as E ∈ { I, X, Y, Z } ⊗ n . A weight w ( E ) is deﬁned fora Pauli operator E on the n physical qubits as the numberof the physical qubits to which the Pauli operator E isnon-trivially applied.In the framework of stabilizer codes [29], the code is de-ﬁned by 2 n − k stabilizer operators L I generated by n − k Pauli operators L I := (cid:104){ S i }(cid:105) (1 ≤ i ≤ n − k ), where S i ∈ ±{ I, X, Y, Z } ⊗ n , − I / ∈ (cid:104){ S i }(cid:105) , and they commutewith each other. The logical space of the code is deﬁnedas the subspace which has eigenvalue +1 for all the sta-bilizer operators, i.e., S i | ψ (cid:105) = | ψ (cid:105) for all i . We denotethe normalizer of the stabilizer operators as L . We callelements in L\L I as logical operators. Each stabilizer op-erator acts on the logical space trivially, and each logicaloperator acts on the logical space non-trivially. A dis-tance d of the code is deﬁned as d := min L ∈L\L I w ( L ).The code which encodes k logical qubits in n physicalqubit with distance d is called [[ n, k, d ]] code.The occurrence of a physical error is detected asthe outcome of stabilizer measurement s , where s T ∈{ , } n − k and the i -th element s i is the measurementoutcome of the i -th stabilizer operator S i . We call s the syndrome vector. To recover the original state ofthe logical qubits, we estimate a recovery Pauli operatorˆ T ( s ) ∈ { I, X, Y, Z } ⊗ n from the observed syndrome vec-tor s so that the total operation including the physicalerror acts on the logical space trivially with high proba-bility. The mapping from the syndrome s to the recoveryoperator ˆ T ( s ) is called decoder ˆ T . The logical error prob-ability p L is deﬁned as the probability with which the to-tal operation becomes logically non-trivial. Our purposeis to construct eﬃcient decoder ˆ T which minimizes thelogical error probability p L . B. Binary representation of stabilizer code

It is convenient to translate the calculation in the stabi-lizer codes into a binary calculation in GF(2). In GF(2),addition ⊕ is performed with modulo 2. We relate thePauli operators on the i -th physical qubit to another rep-resentation I i (cid:55)→ σ ( i )00 , X i (cid:55)→ σ ( i )10 , Y i (cid:55)→ σ ( i )11 , Z i (cid:55)→ σ ( i )01 . (1)Then, a Pauli operator P on the n physical qubits canbe described as P = α n (cid:79) i =1 σ ( i ) v i v n + i , (2)where α ∈ {± , ± i } and v i ∈ { , } (1 ≤ i ≤ n ). Wedeﬁne a binary mapping b ( P ) := v , (3)where v := ( v , v , · · · , v n − , v n ) ∈ { , } n is a rowvector, for the Pauli operator P = α (cid:78) ni =1 σ v i v n + i . Forarbitrary two Pauli operators P and P (cid:48) , b ( P ) = b ( P (cid:48) )means that the two Pauli operators are equivalent up toa global phase. The product of two Pauli operators P and P (cid:48) is represented by the sum b ( P P (cid:48) ) = b ( P ) ⊕ b ( P (cid:48) ).With 2 n × n matrix Λ = (cid:18) II (cid:19) , the commutationrelation of two Pauli operators P and P (cid:48) is given by b ( P )Λ b ( P (cid:48) ) T , which is 0 if P and P (cid:48) commute, and 1 if anti-commute. We denote this commutation relationin terms of the binary representation v , v (cid:48) ∈ { , } n as c ( v , v (cid:48) ) := v Λ v (cid:48) T . The weight of the binary represen-tation of a Pauli operator w ( v ) is deﬁned so as to be w ( b ( P )) = w ( P ), which is equivalent to deﬁne the weightas the number of indices i (1 ≤ i ≤ n ) such that v i ⊕ v i + n ⊕ v i v i + n = 1 . (4)We use h ( v ) for the hamming weight of v as a binarystring, namely, the number of indices i (1 ≤ i ≤ n ) suchthat v i = 1. We denote the i -th row vector of the matrix M as ( M ) i . The length of the vector v is represented as | v | . With this deﬁnition, the normalizer of the stabilizeroperators L is deﬁned as b ( L ) = { v | v ∈ { , } n , c ( v , v (cid:48) ) = 0 , ∀ v (cid:48) ∈ b ( L I ) } (5)since the normalizer of the stabilizer operators is equiv-alent to the centralizer of that in the current formalism.Note that the stabilizer group can be deﬁned with thenormalizer L as b ( L I ) = { v | v ∈ { , } n , c ( v , v (cid:48) ) = 0 , ∀ v (cid:48) ∈ b ( L ) } . (6)With this formalism, QEC is translated as follows. Thephysical error E can be represented as a row binary vec-tor e := b ( E ) ∈ { , } n which occurs with a certainprobability p e . The syndrome vector s is given by a col-umn vector s ( e ) := H c Λ e T , where H c is an ( n − k ) × n matrix of which the i -th row vector ( H c ) i is b ( S i ). Thematrix H c is called check matrix. In binary represen-tation, we denote a decoder as r which maps a givensyndrome vector s T ∈ { , } n − k to a binary representa-tion of a recovery operator r ( s ) ∈ { , } n . It is conve-nient to deﬁne pure error t ( s ) [30] to represent variousvectors succinctly. The pure error is a function whichmaps a syndrome vector s T ∈ { , } n − k to a vector t ( s ) ∈ { , } n , and satisﬁes t ( s ( e )) ⊕ e ∈ b ( L ) for anarbitrary e ∈ { , } n . We also introduce a 2 k × n gen-erator matrix G such that the elements of L is uniquelyrepresented as follows: b ( L ) = { l ⊕ w G | l ∈ b ( L I ) , w ∈ { , } k } . (7)Note that the generator matrix G satisﬁes H c Λ G T = 0.We deﬁne the cosets L w with w ∈ { , } k as L w = { l ⊕ w G | l ∈ L } . (8)Note that L = b ( L I ). Given t ( s ) and G , an arbitraryphysical error e ∈ { , } n is uniquely decomposed as e = l ( e ) ⊕ w ( e ) G ⊕ t ( s ( e )) (9)with l ( e ) ∈ L and w ( e ) ∈ { , } k . We say w ( e ) as theclass of e .A logical decoder with a recovery operation r ( s ) cancorrect an error e if and only if e ⊕ r ( s ( e )) ∈ L . Underan error model { p e } , the logical error probability is givenby p L = Pr e ∼{ p e } [ e ⊕ r ( s ( e )) / ∈ L } ]= Pr e ∼{ p e } [ r ( s ( e )) ⊕ w ( e ) G ⊕ t ( s ( e )) / ∈ L ] } ] . (10) C. Optimal and near-optimal decoders

An optimal decoder is deﬁned as the decoder whichminimizes the logical error probability. Let us write theconditional probability of w ( e ) ∈ { , } n for a givensyndrome vector s as q s ( w ) := Pr e ∼{ p e } [ w ( e ) = w | s ( e ) = s ] . (11)Since the decoder is only provided with s and distinctrecovery operators are needed for correcting errors withdiﬀerent values of w ( e ), the maximum probability of suc-cessful correction given s is max w ∈{ , } q s ( w ) . We thussay a decoder is optimal if r ( s ) satisﬁesPr e ∼{ p e } [ e ⊕ r ( s ) ∈ L | s ( e ) = s ] = max w ∈{ , } k q s ( w )(12)for any s with Pr e ∼{ p e } [ s ( e ) = s ] > . (13)Though the deﬁnition of w ( e ) is dependent on the choiceof t ( s ) and G , the optimality of a decoder r ( s ) is inde-pendent of the choice.Another important deﬁnition of a near-optimal de-coder is the minimum-distance (MD) decoder. An MDdecoder chooses the most probable physical error e ∗ ( s )which satisﬁes p e ∗ ( s ) ≥ p e ∀ e ∈ { e | s ( e ) = e } (14)as a recovery operation. Though the maximally likeli-hood physical error e ∗ ( s ) does not necessarily satisfy thecondition Eq. (12), it is empirically known that the MDdecoder achieves near-optimal performance.It is known that the MD decoder can be constructed ef-ﬁciently in limited cases of the code and the error model.For example, we can construct an eﬃcient MD decoderfor the surface code under independent bit-ﬂip and phase-ﬂip errors. In this case, we can reduce the decoding prob-lem into minimum-weight perfect matching (MWPM),which can be eﬃciently solved with blossom algorithm[31]. When bit-ﬂip and phase-ﬂip errors are correlated,we can still construct a decoder with MWPM by ignoringthe correlation, resulting in an sub-optimal decoder. Wecall such a decoder as a MWPM decoder. D. Topological code

We consider two types of the topological codes in thisarticle: surface codes and color codes. The qubit al-location of the surface code is shown in Fig. 1. The[[2 d − d +1 , , d ]] code and the [[ d , , d ]] code are shownin Fig. 1(a) and (b), respectively. In both ﬁgures, thephysical qubits are located on the vertices of the col-ored faces. Each red face represents a stabilizer operatorwhich is a product of Pauli X operators on the physical qubits of its vertices. Each blue face represents one withPauli Z operators.The color codes consist of the lattice which has 3-colored faces: red, green, and blue. Two types of codes,the [4,8,8]-color code and the [6,6,6]-color code, are shownin Fig. 2(a) and (b), respectively. The physical qubits arealso located on each vertex of the faces. Each coloredface represents a stabilizer operator, including nontrivialPauli operators for its vertices. The [4 , , d + d − , , d ]] code, and the [6 , , d + , , d ]] code. E. Supervised machine learning

Supervised machine learning is a branch of arti-ﬁcial intelligence that requires a training data set { ( x , y ) , . . . , ( x N , y N ) } which consists of feature data x i and its corresponding label data y i . Its aim is to pre-pare a model that takes the feature data as input andoutputs an inferred label for it. The model has a prede-termined structure and trainable parameters θ .Unlike a simple dictionary, the model is expected toinfer a label even for an unseen feature data. This isachieved by optimizing the model parameters θ for thetraining data set. This process is commonly called train-ing . Speciﬁcally, during its training, the diﬀerence be-tween the output of the model y (cid:48) to a feature and thecorrect label y is evaluated with real-valued loss func-tion L ( y , y (cid:48) ). The loss is minimized if and only if theprediction is exactly the same as the correct label. Thetraining data is used to optimize the model parameters θ to reduce the loss. This can be done with standard op-timization methods such as stochastic gradient descent: θ ← θ − γ ∇ θ L, (15)where γ ∈ R is a learning rate and L is calculated for arandomly chosen subset, called a batch , of the trainingdata set. As we can see here, it is required that theloss function should be diﬀerentiable, such as L2 distance || y − y (cid:48) || . Once trained, we can apply the model to anunseen feature data, and obtain its predicted label withsimple calculations of the network parameters and theinput feature data.Artiﬁcial Neural Network (ANN) is a machine learn-ing model inspired by neural structure found in nature.Here, we assume that neurons are real-valued functionsand a layer h is a vector of the neurons. Multilayer per-ceptron (MLP) is one of the simplest ANN which, as itsname suggests, consists of multiple layers of neurons in-cluding the input and output layers. Here each neuron ina layer is connected to all neurons in the neighboring lay-ers with trainable weights and biases, and yet completelyindependent of the other neurons in its own layer. Math-ematically, this can be described as h ( n ) i = A ( (cid:88) i W ( n,n − ij h ( n − j + b ( n ) i ) (16) (a) (b) FIG. 1: The qubit allocation of the surface codes with (a) [[2 d − d + 1 , , d ]] code and (b) [[ d , , d ]] code. Eachvertex corresponds to a physical qubit. Red and blue faces correspond to stabilizer measurements with X and Z Pauli operators, respectively. (a) (b)

FIG. 2: The qubit allocation of the [4,8,8]-color code and the [6,6,6]-color code. Each vertex corresponds to aphysical qubit, and each face corresponds to a stabilizer operator.where A is a nonlinear activation function, h ( n ) i is the i -th neuron in the n -th layer, b ( n ) i is the bias added to the i -th neuron in the n -th layer, and W ( n,n − ij is the weightconnecting the i -th neuron in the n -th layer to the j -th neuron in the ( n − ∇ θ L is evaluated with the back-propagationmethod. According to the universal approximation the-orem [32], any continuous function can be approximatedby an MLP model of a ﬁnite size, though its structure issimple and compact. Thus, we expect a neural decoderwith a MLP model can achieve near-optimal performance under an appropriate training process. III. CONSTRUCTION OF TASKS OFMACHINE-LEARNING-BASED DECODERS

In general, achievable accuracy in machine learningwith a given size of training data depends on the for-mulation of the prediction task. In order to construct anear-optimal neural decoder, it is vital to consider what isa preferable formulation of the prediction task. However,this point has not been discussed in a uniﬁed view in theexisting methods [24–27]. In this section, we discuss howthe decoding problem should be formulated as a task ofmachine learning in order to achieve near-optimal per-formance. To this end, we propose a general framework,which we call linear prediction framework . In this frame- 𝒋 𝒊 𝑊 $% 𝒉 (()*) 𝒉 (() FIG. 3: Feed forward network. The j -th neuron in the( n − i -th neuron in the n via weight W ij .work, we can analytically study the behavior of the neu-ral decoder, and can discuss requirements for achievingnear-optimal performance. Based on the discussion, wepropose a criterion, normalized sensitivity , which shouldbe optimized in deﬁning the label for constructing a gooddecoder. We show speciﬁc constructions which minimizenormalized sensitivity for the surface codes and the colorcodes, which we call uniform data construction . Then,we numerically conﬁrm that the performance of the neu-ral decoder is improved with the construction. We alsoconﬁrm that this construction is also applicable to thecolor codes. A. Linear prediction framework

In order to discuss the behavior of the neural decoderin a uniﬁed view, we consider a neural decoder with thefollowing two speciﬁcations. First, the neural decoderuses the syndrome vector s as the feature data to befed to the trainable model. Second, the label data is abinary vector, and the correct label is linearly generatedfrom the physical error vector e in GF(2). We call alinearly generated label vector g as a diagnosis , and amatrix H g which generates the diagnosis g := H g Λ e T asa diagnosis matrix. We denote the length of the diagnosisvector g as L g . The recovery operator r is calculatedfrom the predicted diagnosis g and the syndrome s . Weuse an assumed physical error distribution { p e } only forgenerating a training data set { ( s i , g i ) } , and do not use it for constructing H g or in the calculation of the recoveryoperator r from g and s . Though this framework restrictsthe label to be linearly generated from the physical error,this is general enough to formulate all the constructionsdescribed in the existing methods as special cases [24–28]with small technical exceptions.Since the actual performance of the neural decoder de-pends on many factors such as conﬁgurations of the train-ing process, the size of the training data set, and detailsof the network construction, we start with consideringthe problem under an ideal limit. We ﬁrst consider theproblem under the simple 0-1 loss function with an unlim-ited size of the training data set. Then, we relax theseimpractical assumptions to practical ones. Though wenumerically investigate the case of a single logical qubit( k = 1) later, we present the formalism for a generalvalue of k .

1. The neural decoder with the 0-1 loss function and anunlimited training data set

We ﬁrst consider a hypothetical decoder that can min-imize any loss function with an unlimited number of thetraining data set. Though such an assumption is notpractical, it is convenient to reveal the conditions for per-forming optimal decoding with machine learning in theideal limit. We choose the 0-1 delta function δ ( g , g (cid:48) ) asthe loss function, which is zero if the predicted and thecorrect diagnosis are the same, and unity otherwise. Letus consider the portion of training data set with a speciﬁcvalue of s with Pr e ∼{ p e } [ s ( e ) = s ] >

0. If the neural de-coder returns diagnosis g for the input s , the total lossfor this portion is proportional to the following value, L ( δ ) s ( g ) := E e ∼{ p e } (cid:2) δ ( g , H g Λ e T ) (cid:12)(cid:12) s ( e ) = s (cid:3) = 1 − Pr e ∼{ p e } (cid:2) H g Λ e T = g (cid:12)(cid:12) s ( e ) = s (cid:3) . (17)Let g ( δ ) ( s ) be the output of the ideally trained neuraldecoder. Since it should minimize the total loss for every s , it satisﬁes L ( δ ) s ( g ( δ ) ( s )) = min g L ( δ ) s ( g ) . (18)We call this ideal decoder a delta diagnosis decoder and g ( δ ) ( s ) a delta diagnosis vector .We show the condition for a diagnosis matrix H g toguarantee that we can perform the optimal decoding withthe delta diagnosis decoder. To this end, we deﬁne aproperty of the diagnosis matrix and introduce a set ofdiagnosis vectors as follows. Deﬁnition III.1. faithful diagnosis matrix — Given acheck matrix H c , we say diagnosis matrix H g is faithful if span( { ( H cg ) i } ) = b ( L ) , (19)or equivalently, H cg Λ e T = 0 ↔ e ∈ L , (20)where H cg := (cid:18) H c H g (cid:19) . (21) Deﬁnition III.2. faithful diagnosis vectors — Given acheck matrix H c , a pure error t ( s ), and a faithful diag-nosis matrix H g , we deﬁne 2 k faithful diagnosis vectors { g s ( w ) } ( w ∈ { , } k ) associated with a syndrome vec-tor s by g s ( w ) := H g Λ( w G ⊕ t ( s )) T . (22)Note that the faithful condition of H g implies that w (cid:55)→ g s ( w ) (23)is injective and H g Λ e T = g s ( w ( e )) , (24)with s = H c Λ e T . As a result, when H g is faithful, wehave1 − L ( δ ) s ( g ) = Pr e ∼{ p e } [ g s ( w ( e )) = g | s ( e ) = s ] (25)from Eqs. (17) and (24). Then the injective property of g s ( w ) leads to 1 − L ( δ ) s ( g s ( w )) = q s ( w ) , (26)where q s ( w ) is deﬁned in Eq. (11).When the diagnosis matrix is faithful, we can constructan optimal decoder as follows. From Eqs. (18) and (26),we see that the delta diagnosis vector g ( δ ) ( s ) is one ofthe faithful diagnosis vectors. We can thus write it inthe form g ( δ ) ( s ) = g s ( w ∗ ( s )) . (27)Eqs. (18), (26), and (27) imply that1 − q s ( w ∗ ( s )) = L ( δ ) s ( g ( δ ) ( s ))= min w ∈{ , } k (1 − q s ( w ))= 1 − max w ∈{ , } k q s ( w ) . (28)Since g s ( w ) is injective, one can calculate w ∗ ( s ) from thediagnosis g ( δ ) ( s ) and syndrome s . The recovery operatoris then chosen as r ( s ) = w ∗ ( s ) G ⊕ t ( s ) . (29)For the optimality, we havePr e ∼{ p e } [ e ⊕ r ( s ) ∈ L | s ( e ) = s ] = q s ( w ∗ ( s ))= max q s ( w ) (30)for any s with Pr e ∼{ p e } [ s ( e ) = s ] >

0, which satisﬁesEq. (12).We can also prove a converse statement for the caseswhere H g is not faithful (see Appendix A), arriving atthe following lemma. Lemma III.1.

If the diagnosis matrix H g is faithful,there exists a map r ∗ ( g , s ) such that the decoder with r ( s ) = r ∗ ( g ( δ ) ( s ) , s ) is optimal for arbitrary distribution { p e } . If the diagnosis matrix H g is not faithful, no suchmap exists.This lemma implies that we can perform optimal de-coding with the delta diagnosis decoder only when thediagnosis matrix H g is faithful. Note that the set of thefaithful vectors { g s ( w ) | w ∈ { , } k } is independent ofthe choice of the generator G and the pure error t ( s ).Whether we can perform the optimal decoding or not isdependent only on the construction of H g .

2. The neural decoder with the L2 loss function and anunlimited training data set

In this subsection, we replace the 0-1 loss function witha more practical one, which is the squared L2 distance.We still consider the limit of an inﬁnite size of the trainingdata set and the perfect loss minimization. In this case,the total loss for a ﬁxed s under an unlimited trainingdata set is proportional to the following value. L (L2) s ( g ) = E e ∼{ p e } (cid:2) || g − H g Λ e T || (cid:12)(cid:12) s ( e ) = s (cid:3) (31)We deﬁne a decoder which is ideally trained with the L2loss function as an L2 diagnosis decoder . We also callthe output of the L2 diagnosis decoder as an

L2 diagno-sis vector g (L2) ( s ). The L2 diagnosis vector satisﬁes thefollowing equation. L ( L s ( g (L2) ( s )) = min g ∈{ , } Lg L (L2) s ( g ) . (32)When the chosen diagnosis matrix is faithful, we can an-alytically solve g (L2) ( s ) by diﬀerentiating Eq. (31), andthe L2 diagnosis vector can be written as follows. g (L2) ( s ) := (cid:88) w ∈{ , } k q s ( w ) g s ( w ) (33)Let us deﬁne a column vector of order 2 k as q s := ( q s (0 k ) , . . . q s (1 k )) T . (34)It satisﬁes the following matrix equation: (cid:18) ˆ g (L2) ( s )1 (cid:19) = D s q s , (35)where D s = (cid:18) g s (0 k ) · · · g s (1 k )1 · · · (cid:19) . (36)We can solve it for q s if D s has a left inverse D − s suchthat D − s D s = I in the real-valued calculation, namely, ifthe rank of D s as a real-valued matrix is 2 k . If the rankis smaller, solution q s is not unique, and hence it is notalways possible to determine w that maximizes q s ( w ),which implies we cannot perform the optimal decoding.Though the rank condition depends apparently on thesyndrome s , we can formulate it as a condition which isindependent of s . Any faithful diagnosis g s ( w ) can bewritten as g s ( w ) = H g Λ( w G ) T ⊕ δ ( s ) (37)with δ ( s ) := H g Λ t ( s ) T ∈ (cid:0) { , } L g (cid:1) T . (38) We deﬁne a transformation σ δ by( σ δ ( v )) i := δ i + ( − δ i v i (39)for δ ∈ { , } k and v ∈ R k . It is aﬃne, isometric, andinvolutory. Since g s ( w ) = σ δ ( s ) ( H g Λ( w G ) T ), we have D s = (cid:18) σ δ ( s ) ( H g Λ((0 k ) G ) T ) · · · σ δ ( s ) ( H g Λ((1 k ) G ) T )1 · · · (cid:19) . (40)We see that a transformation σ δ is an aﬃne transforma-tion, and this transformation satisﬁes σ δ ( σ δ ( v )) = v (41) σ ( v ) = v . (42)Thus, when we apply the transformation σ δ ( s ) to Eq. (35), we obtain (cid:18) σ δ ( g (L2) ( s ))1 (cid:19) = D q s , (43)where D := (cid:18) H g Λ((0 , . . . , G ) T · · · H g Λ((1 , . . . , G ) T · · · (cid:19) . (44)Thus, we can uniquely calculate q s for an arbitrary s if a matrix D has a left inverse, which is equivalent tothe condition that { H g Λ( w G ) T | w ∈ { , } k } is aﬃnelyindependent. We will call a diagnosis matrix satisfyingthis condition to be decomposable: Deﬁnition III.3. decomposable diagnosis matrix —Given a generator matrix G , we say a diagnosismatrix H g is decomposable if a set of real vectors { H g Λ( w G ) T | w ∈ { , } k } is aﬃnely independent,namely, the rank of a matrix D deﬁned in Eq. (44) is2 k when we consider D as a real-valued matrix.When H g is faithful, the above deﬁnition is indepen-dent of G , because the set { H g Λ( w G ) T | w ∈ { , } k } isindependent of G then.We show a scheme to perform the optimal decodingusing L2 diagnosis decoder when a diagnosis matrix isfaithful and decomposable. When H g is decomposable,there exists a left inverse D − such that D − D = I inreal vector space. When we observe a syndrome vector s ,we obtain the L2 diagnosis g (L2) ( s ) using the trained L2diagnosis decoder, and calculate δ ( s ) = H g Λ t ( s ). Sincethe diagnosis matrix is faithful, the probabilities of thefaithful diagnosis vectors are given by q s = D − (cid:18) σ δ ( s ) ( g (L2) ( s ))1 (cid:19) . (45) Then, we construct a recovery operator as r ( s ) = w ∗ ( s ) G ⊕ t ( s ) , (46)where w ∗ ( s ) satisﬁes q s ( w ∗ ( s )) = max w q s ( w ) . (47)With this recovery operator, we obtainPr e ∼{ p e } [ e ⊕ r ( s ) ∈ L | s ( e ) = s ] = q s ( w ∗ ( s )) , (48)and thus this decoder satisﬁes Eq. (12).When the diagnosis matrix H g is faithful, we can alsoprove a converse statement for the cases where a faithfuldiagnosis matrix H g is not decomposable (see AppendixA), arriving at the following lemma. Lemma III.2.

If the diagnosis matrix H g is faithful anddecomposable, there exists a map r ∗ ( g , s ) such that thedecoder with r ( s ) = r ∗ ( g (L2) ( s ) , s ) is optimal for arbi-trary distribution { p e } . If the diagnosis matrix H g isfaithful but not decomposable, no such map exists.We show a simple example of a faithful and decompos-able matrix H g in the case of k = 1. We choose vectors l , l , and l from L , L , and L , respectively. Weconstruct H g and generator G as H g =  l l l  , (49) G = (cid:18) l l (cid:19) . (50)We see that span( { ( H g ) i } ) = b ( L ), and thus H g is faith-ful. A set { H g Λ( w G ) T | w ∈ { , , , }} is { (0 , , T , (0 , , T , (1 , , T , (1 , , T } , (51)which is aﬃnely independent, and thus H g is decompos-able. We can verify the same by checking the rank of D = (cid:18) g (00) g (01) g (10) g (11)1 1 1 1 (cid:19) =   (52)to be 4 in real vector space. We can show that therealways exists such a faithful and decomposable diagnosismatrix for all k and H c . See Appendix A for the proof.

3. The neural decoder with the L2 loss function under aﬁnite training data size

In practical cases, the size of the training data set islimited, and hence the loss is not perfectly minimized.This implies that the output diagnosis from the modeldeviates from the L2 diagnosis vector. In such a case, itis desirable to construct a decoder such that its predic-tion is as robust against the deviations as possible. Weintroduce a slight modiﬁcation to the optimal decodingscheme in the last subsection, so that it should applica-ble to an output diagnosis deviated from the L2 diagnosisvector.We denote the predicted diagnosis as g P ( s ) ∈ R L g ,which deviates from the L2 diagnosis vector. Note that g P ( s ) cannot be represented as a linear combination ofthe faithful diagnosis vectors in general. In order toconstruct a decoding scheme which is robust to a smalldeviation, it is natural to extend the scheme employedin Sec. III A 2 such that we project g P ( s ) to the hyper-plane formed by aﬃne combinations of the faithful diag-nosis vectors, and then extract the coeﬃcients q P s fromthe projected point. This projection and extraction isachieved as follows. We perform QR decomposition for D , and obtain D = QR , where Q is an orthogonal ma-trix, and R is an upper-triangular matrix. We construct D − = R − Q T , which satisﬁes D − D = I . Then, weobtain a predicted vector q P s as q P s = D − (cid:18) σ δ ( s ) ( g P ( s ))1 (cid:19) , (53)where δ ( s ) = H g Λ t ( s ). We construct a recovery opera-tor as r ( s ) = w ∗ ( s ) G ⊕ t ( s ) , (54) where w ∗ ( s ) satisﬁes q P s ( w ∗ ( s )) = max w q P s ( w ) . (55)Note that though elements of q P s may be out of [0 ,

4. Criterion for diagnosis matrix

In practice, the number of the training data set is farsmaller than the total variation of syndrome vectors s when distance d is larger than about 7. For example,according to the existing methods [24–27], the size of thetraining data set is at most 10 . On the other hand, thenumber of variations in the syndrome, 2 n − k , becomeslarger than 10 at the distance d = 7 for the [[ d , , d ]]surface code. This implies that almost all the patterns ofthe syndrome vector s given in experiments are not foundin the training data set. The model should infer the L2diagnosis vector g (L2) ( s ) of s where s is not included inthe training data set. The aim of this subsection is topropose a criterion for H g which we believe to reﬂect therobustness of the prediction when we use such a sparselysampled training data set.Since the problem is to estimate the vector-valuedfunction g (L2) ( s ) from a sparsely sampled set of values,its diﬃculty should depend on how rapidly the functionchanges its output value as the input value s varies. FromEqs. (24) and (33), we see that the function is written as g (L2) ( s ) = E e ∼{ p e } [ H g Λ e T | s ( e ) = s ] , (56)which shows that g (L2) ( s ) is implicitly determined fromthe two functions of errors, g ( e ) = H g Λ e T and s ( e ) = H c Λ e T . In order to quantify how rapidly these functionchange, let us introduce a sensitivity m ( H ) of a binarymatrix H as m ( H ) := max e , e (cid:48) ∈{ , } n h ( e ⊕ e (cid:48) )=1 || H Λ e T − H Λ e (cid:48) T || = max e ∈{ , } n h ( e )=1 h ( H Λ e T ) . (57)Using the sensitivity, the variation of s ( e ) is boundedas || s ( e ) − s ( e (cid:48) ) || ≤ m ( H c ) h ( e ⊕ e (cid:48) ) . (58)In the case of topological codes, m ( H c ) is a small con-stant. This is because each physical qubit is monitoredby at most constant number of the stabilizer operators.Suppose that g (L2) ( s ) is close to one of the faithfuldiagnosis g s ( w ∗ ), and let S ( s , w ∗ ; 0) be the set of errors e satisfying w ( e ) = w ∗ and s ( e ) = s . We further deﬁnea set0 S ( s , w ∗ ; h ) := { e |∃ e (cid:48) s.t. e (cid:48) ∈ S ( s , w ∗ ; 0) , h ( e ⊕ e (cid:48) ) ≤ h } (59)We see that any e ∈ S ( s , w ∗ ; h ) produces a training data( s (cid:48) , g (cid:48) ) such that || s (cid:48) − s || ≤ m ( H c ) h (60) || g (cid:48) − g s ( w ∗ ) || ≤ m ( H g ) h. (61)The choice of H g also aﬀects how precisely g (L2) ( s )should be estimated in order to determine w ∗ correctly. To quantify this, we consider how far g P ( s ) can be devi-ated from a faithful diagnosis g s ( w ) without aﬀecting thedecoding method of Eqs. (53) and (54). When the decod-ing result changes from w ∗ = w to w ∗ = w (cid:48) , the solutionof Eq. (53) should satisfy q P s ( w ) = q P s ( w (cid:48) ), namely, g P ( s )should be written in the form g P ( s ) = α ( g s ( w ) + g s ( w (cid:48) )) + (cid:88) w (cid:48)(cid:48) (cid:54) = w , w (cid:48) β w (cid:48)(cid:48) g s ( w (cid:48)(cid:48) ) . (62)We deﬁne the minimum boundary distance M ( H g ) so as to assure that w ∗ = w as long as || g P ( s ) − g s ( w ) || ≤ M ( H g ). Hence M ( H g ) can be explicitly deﬁned as M ( H g ) := min w , w (cid:48) ,α, { β w (cid:48)(cid:48) } || (1 − α ) g ( w ) − α g ( w (cid:48) ) − (cid:88) w (cid:48)(cid:48) (cid:54) = w , w (cid:48) β w (cid:48)(cid:48) g ( w (cid:48)(cid:48) ) || . (63)Note that the above deﬁnition is independent of s , sincethe aﬃne transformation σ δ ( s ) is isometric. M ( H g ) isnonzero if and only if H g is decomposable.Regarding M ( H g ) as the relevant length scale, we de-ﬁne the following quantity to be used as a criterion for abetter construction of H g . Deﬁnition III.4.

Normalized sensitivity —

We deﬁnenormalized sensitivity N ( H g ) of a faithful and decom-posable matrix H g as N ( H g ) := m ( H g ) M ( H g ) , (64)where m ( H g ) is a sensitivity of H g deﬁned in Eq. (57),and M ( H g ) is a minimum boundary distance of H g de-ﬁned in Eq. (63).Eqs. (61) and (63) implies that an error belonging to S ( s , w ∗ ; h ) with h ∼ ( m ( H g ) /M ( H g )) − leads to a train-ing data useful for estimation of g (L2) ( s ). We thus ex-pect that the use of a diagnosis matrix H g with a smallnormalized sensitivity N ( H g ) enables high-performanceprediction with a small training data set.

5. Uniform data construction

We propose speciﬁc constructions which minimize thenormalized sensitivity up to the order of d in the case of k = 1. We ﬁrst consider a lower-bound of the normal-ized sensitivity. When a diagnosis matrix H g is faithful,each row vector of H g corresponds to a logical operatoror a stabilizer operator. We denote the number of thelogical operators in the rows of H g as n L . The minimumboundary distance M ( H g ) is upper-bounded by M ( H g ) ≤ n L d of one-elementsin its binary representation, there are at least dn L ofone-elements in the diagnosis matrix. By denoting thenumber of the one-elements in the diagnosis matrix H g as χ ( H g ), we have dn L ≤ χ ( H g ) . (66)Since there are 2 n columns in H g , we also have χ ( H g ) ≤ n max i h (( H T g ) i ) . (67)The sensitivity m ( H g ) is equal to the maximum hammingweight of the column vectors of the diagnosis matrix,namely, max i h (( H g ) T i ) = m ( H g ) . (68)From Eqs. (65) - (68), we obtain N ( H g ) ≥ dn (69)1In particular, when we focus on the two-dimensionaltopological codes such that n = Θ( d ) , the order ofthe normalized sensitivity is lower-bounded as N ( H g ) = Ω( d − ) . (70)For surface codes and color codes with the single logicalqubit, we found speciﬁc constructions of H g such that N ( H g ) scales as Θ( d − ). See appendix C for the speciﬁcconstructions. We named these constructions as uniformdata construction of the data set, since logical operatorscorresponding to the rows of H g are chosen uniformly tocover all the physical qubits. B. Construction of data set and example

Let us summarize the discussion in Sec III A. Given thecheck matrix H c of the code and the error model { p e } ,the whole protocol can be described as follows. • Preparation:

We construct a faithful and de-composable diagnosis matrix H g with a small nor-malized sensitivity, possibly N ( H g ) = Θ( d/n ). Wechoose a pure error t ( s ) and a generator matrix G .We perform QR decomposition to a matrix D := (cid:18) g (0 k ) · · · g (1 k )1 · · · (cid:19) , (71)where g ( w ) = H g Λ( w G ) T , and obtain Q and R .We calculate the left inverse matrix D − as D − := R − Q T . (72) • Data generation:

We generate a set of physicalerrors { e , e , . . . } with the probability distribution { p e } , and generate data set { ( s , g ) , ( s , g ) , . . . } from it, where s i := H c Λ e T i and g i := H g Λ e T i . • Training:

The model is trained so that it canpredict g from s . The loss of the prediction is de-ﬁned as the L2 distance between g and g P , where g P is a real-valued output vector of the model. • Prediction:

When an observed syndrome s isgiven to the trained model, it predicts g P ( s ). Inparallel, we calculate δ ( s ) given by δ ( s ) = H g Λ t ( s ) T . (73)We calculate vector q s deﬁned in Eq. (34) as q P s = D − (cid:18) σ δ ( s ) ( g P ( s ))1 (cid:19) , (74)where σ δ ( s ) is an aﬃne transformation such that( σ δ ( s ) ( v )) i = δ i + ( − δ i v i . (75) We choose w P that satisﬁes q P s ( w P ) = max w ∈{ , } k q P s ( w ) , (76)where { q P s ( w ) } is the elements of q P s . Then, weobtain an estimated recovery operator r ( s ) = w P G ⊕ t ( s ) . (77)We emphasize that the choice of t ( s ) and G dose notaﬀect the performance of the decoder, since the successof the estimation is independent of them. Only the con-struction of H g aﬀects the performance of the decoder.We show a speciﬁc example of the decoding scheme.For simplicity, we consider the case where there is onlybit-ﬂip errors in the [[2 d − d + 1 , , d ]] surface code. Inthis case, it is enough for QEC to consider the stabilizeroperator with Pauli Z operators. The simpliﬁed pictureof the code is shown in Fig. 4.In this picture, a bit-ﬂip error on a physical qubit isrepresented by the color of the corresponding edge (green:no error, red: error), and the syndrome value is rep-resented by the color of the circle (green: undetected,red: detected). As shown in Fig. 4(a), The matrix H g isconstructed with logical operators each of which is theproduct of the Pauli Z operators on the edges crossingthe dotted line. In this case, we see M ( H g ) = O ( d ), m ( H g ) = O (1), and m ( H g ) M ( H g ) = O ( d − ).Suppose that bit-ﬂip errors occur on a set of the phys-ical qubits as shown in Fig. 4(b). The physical error isdetected with the syndrome values as shown in the sameﬁgure. The diagnosis vector is calculated as the com-mutation relation of the chosen logical operators and thephysical error. We show the calculated diagnosis on theright side of the lattice. In the training phase, the modellearns the relation between the positions of the red circlesand the values of the diagnosis vector. In the predictionphase, only the positions of the red circles are given. Thetrained neural network outputs a real-valued predictionof the diagnosis vector as shown in Fig. 4(c), for exam-ple. From this information, we extract the probabilitiesof the faithful diagnosis, and we choose the faithful di-agnosis which is expected to be the most probable, asshown in Fig. 4(d). Since the chosen diagnosis vector isequivalent to the diagnosis vector generated by the actualphysical error, this decoding trial is a success. C. Relation to the existing methods

In this subsection, we explain how the existing meth-ods [24–27] can be treated in the linear prediction frame-work. The method proposed by Varsamopoulos et al. [25] used an approach similar to the example shown inSec. III A 2 in the case of k = 1. In this method, a linearmap is used for the pure error, which is called a sim-ple decoder. The pure error is then written in the form2 (a) (b) (c) (d) FIG. 4: The ﬁgures show the decoding process based on proposed scheme. Each picture shows only Z lattice, ofwhich the edge corresponds to whether there is a bit-ﬂip error on the physical qubit or not, and the circle showswhether an error is detected through the syndrome measurement. (a) Five logical Z operators which minimize thenormalized sensitivity m ( H g ) M ( H g ) . (b) The actual physical error is drawn as red edges, and the detected syndromes asred circles. The binary numbers shown to the right is the diagnosis vector of the physical errors. The neural networklearns the relation between the location of the detected syndromes and the diagnosis vector. (c) The real-valueddiagnosis vector is predicted by the neural decoder. (d) With the syndrome pattern, faithful diagnosis vector iseither 10000 or 01111. The chosen faithful diagnosis vector is 10000. Accordingly, we choose the recovery operatorshown in the ﬁgure. In this case, the decoding succeeds. t ( s ) T = T s , where T is a 2 n × ( n − k ) matrix satisfying H c Λ T = I . The label vector used in this method canessentially be regarded as being generated by a diagnosismatrix deﬁned by H g =  l l l  ( I ⊕ Λ T H c ) . (78)We see this is faithful and decomposable constructions.Let a generator matrix G be G = (cid:18) l l (cid:19) . (79)Then, a diagnosis generated from the diagnosis matrix is g = H g Λ e =  w ( e ) w ( e ) w ( e ) ⊕ w ( e )  , (80)where w ( e ) = ( w ( e ) , w ( e ) ). The method in Ref. [25]uses a diﬀerent set of label vectors g (cid:48) called one-hot rep-resentation, which has a one-to-one correspondence with g as g = (0 , , T (cid:55)→ g (cid:48) = (1 , , , T (81) g = (0 , , T (cid:55)→ g (cid:48) = (0 , , , T (82) g = (1 , , T (cid:55)→ g (cid:48) = (0 , , , T (83) g = (1 , , T (cid:55)→ g (cid:48) = (0 , , , T . (84)The above relation as real vectors can be written as g (cid:48) = 12  − − − − − −  (cid:18) g (cid:19) +   . (85)Since it is an isometric aﬃne transformation, we expectthat this transformation has little eﬀect on the perfor-mance of the supervised machine learning. The matrix H g is faithful and decomposable, but its normalized sensi-tivity is O (1). We thus expect that this decoder becomesnear-optimal when the training is ideally performed, butthe prediction is not robust when the size of the trainingdata set is small.The method proposed by Baireuther et al. [26] mainlyfocuses on a model applicable to quantum error correc-tion when we perform various counts of repetitive stabi-lizer measurements by utilizing recurrent neural network.They use the commutation relation between the physicalerror and a logical Z operator as the label, since theyonly concerned about the logical bit-ﬂip probability withthe ﬁxed initial state in the logical space. We can thusconsider this method as a case of the linear predictionframework.Torlai et al. [24], Krastanov et al. [27], and Breuck-mann et al. [28] took a diﬀerent approach from the abovetwo [25, 26]. They used the binary representation of thephysical error as the label vector. In the linear predictionframework, it corresponds to a choice of H g = Λ leadingto g = H g Λ e T = e T (86)Since H g is not faithful, it cannot constitute an optimaldecoder even with the delta diagnosis decoder. Interest-ingly, the delta diagnosis decoder with this choice of H g works as an MD decoder, which can be shown by thefollowing lemma. Lemma III.3.

If the matrix H cg has rank 2 n in GF(2),there exists a map r ∗ ( g , s ) such that the decoder with r ( s ) = r ∗ ( g ( δ ) ( s ) , s ) works as an MD decoder for arbi-trary distribution { p e } . If H cg does not have rank 2 n ,no such map exists. Proof. If H cg has rank 2 n , there exists a left inverse bi-nary matrix H − cg such that H − cg H cg = I . Then, we can3obtain the physical error e asΛ H − cg (cid:18) sg (cid:19) = e T . (87)Thus, we can obtain the most probable physical error e ∗ ( s ) from the most probable diagnosis.If H cg does not have rank 2 n in GF(2), there exist twophysical errors which generate the same pair of syndromeand diagnosis. We cannot determine which is more prob-able. Thus, we cannot perform MD decoding when H cg does not have rank 2 n .A drawback in this approach is diﬃculty arising whenwe replace a loss function with a practical one such asL2 distance. In order to satisfy a decomposable propertyin MD decoding, the length of the diagnosis must be noshorter than 2 n + k since there are 2 n + k possible candi-dates of the most probable physical error. This is notpractical when the distance is large, and thus it requiresheuristics such as repetitive sampling. D. Numerical result

We numerically show that the uniform data construc-tion improves the performance of the neural decoder inthe case of k = 1. We trained an MLP model with theuniform data construction, and compare it with otherdata constructions of the neural decoders. We also makea comparison with known decoders such as the MD de-coder and the MWPM decoder. We choose the [[ d , , d ]]surface code for the comparison, since most of the exist-ing methods were benchmarked with this code. We cal-culated the performance for two types of error models,the bit-ﬂip noise and depolarizing noise. The probabilitydistribution of the bit-ﬂip noise is described as follows. p e = (cid:40) p w ( e ) (1 − p ) n − w ( e ) ∀ i > n, e i = 00 otherwise , (88)where p is an error probability per physical qubit, and w ( e ) is a weight of physical error e deﬁned in Eq. (4).The probability distribution of the depolarizing noise isdescribed as follows. p e = ( p/ w ( e ) (1 − p ) n − w ( e ) . (89)Note that the occurrences of the bit-ﬂip and phase-ﬂiperrors are correlated in the depolarizing noise. We ﬁrstcalculated the performance when the physical error prob-ability is around the error threshold, namely, p = 0 . p = 0 .

15 for the depolarizingnoise. The tunable hyper-parameters of the neural net-work, such as number of layers in network, number ofneurons in each layer, and learning rate, are optimizedwith a grid search for each noise model and for each sizeof the training data set. See Appendix B for the detailsof the parameter optimization and implementation. The performance of the neural decoder under the bit-ﬂip noise is shown in Fig. 5(a). The solid lines are theperformance of the neural decoder with the uniform dataconstruction. The bottom dashed lines represent the log-ical error probability achievable with the MD decoder.The colors red, green, blue, and cyan correspond to dis-tances 5, 7, 9, 11, respectively.Comparing these two types of decoders, we see thatthe logical error probability of the neural decoder is near-optimal with 10 data set at distance 11. On the otherhand, there are gaps between the converged logical errorprobabilities of the neural decoder and that of the MDdecoder when the distance is large. We speculate thatthese gaps are caused by imperfect learning of the spatialinformation of the topological codes, since it is partiallyimproved with the network construction discussed in thenext section.We also implemented the neural decoder with short di-agnosis, i.e. the construction with N = N = N = 1,where N w is a number of logical operators in the rowsof H g corresponding to the class w . This is equivalentto the construction which we showed as an example inSec. III A 2. We call this construction, with the normal-ized sensitivity of O (1), as short diagnosis construction,which is shown as the pale plots in Fig. 5(a). Note thatthe performance of this decoder depends on the choiceof the logical operators. We have tried this constructionwith various choice of the logical operators. The plot-ted data is the best among our trials. Although bothconstructions become near-optimal in the limit of largetraining data size, we see that the performance with theuniform data construction achieves smaller logical errorprobability than that with the short diagnosis construc-tion for any size of the training data set. We have alsoconﬁrmed that the performance of the neural decoder de-grades when the row vectors of H g consist of the same O ( d ) logical operators of X , Y and Z . In this case,while the number of the rows in H g is the same as thatof the uniform data construction, the sensitivity m ( H g )becomes O ( d ), which makes the normalized sensitivity m ( H g ) M ( H g ) to be O (1). Though these results are not plot-ted, the performance of this construction is almost thesame as the short diagnosis construction. These resultssupport our argument that it is essential for the perfor-mance of the neural decoder to minimize the normalizedsensitivity.The results with the depolarizing noise are shown inFig. 5(b). Note that for the surface code under corre-lated noise such as the depolarizing noise, it is not knownhow an eﬃcient MD decoder can be constructed. We seethat the performance of the neural decoder becomes near-optimal, and is superior to that of the MWPM decoderwith 10 training samples at d = 5 , , , and calculatedthe performance for the distance d = 5 , , ,

11 and for4 The size of the training data set0.100.150.200.250.300.350.400.450.50 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) The size of the training data set0.10.20.30.40.50.60.7 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) FIG. 5: The performance comparison between the neural decoder with the uniform construction (solid lines) andthat with short diagnosis construction (pale lines), the MD decoder (dashed lines), and the MWPM decoder (dottedlines) in the case of the [[ d , , d ]] surface code. The logical error probabilities are plotted against of the sizes of thetraining data set with the ﬁxed physical error probability p . We calculated the performance for distances d = 5(red), 7 (green), 9 (blue), and 11 (cyan). (a) The case for the bit-ﬂip noise with p = 0 .

1. Note that there are no linesof MWPM decoder since the MWPM decoder is equivalent to the MD decoder in this setting. (b) The case for thedepolarizing noise with p = 0 . . For both ofthe noise models, the performance is near-optimal whenthe distance is small. On the other hand, when thedistance becomes large, the logical error probability be-comes larger than that of the MWPM decoder. The errorthreshold is usually estimated with the cross point of theperformance in terms of the distance. We see that the er-ror threshold based on the distance is worse than that ofthe MWPM decoder, though the logical error probabilityis smaller than that of the MWPM decoder.The actual experiment is expected to be performedwith a physical error probability suﬃciently smaller thanthe threshold value. Therefore, we calculated the perfor-mance of the decoder with a small physical error proba-bility. The numerical results are shown in Fig. 7. Sincethe training data set generated with a small value of p ishighly imbalanced, we trained the model with p = 0 . p = 0 .

11 for thedepolarizing noise model. Then, we tested the trainedmodel with the data set generated with p ≤ .

1. Wesee that the logical error probability is smaller than theMWPM decoder in this region, for all the distances ex-cept d = 11.We also calculated the performance of the neural de-coder for two types of color codes. We chose the size oftraining data set as 10 , and calculated the logical er-ror probability for the distance d = 3 , , ,

9. Note thatwe cannot construct an eﬃcient MD decoder in the colorcode even under independent bit-ﬂip and phase-ﬂip noise.The plots of the logical error probability to the physicalerror probability p are shown in Fig. 8. The conﬁgura- tions of the plots and lines are the same as that for thesurface code. In the case of the bit-ﬂip noise, the near-optimal performance is achieved. The performance is alsonear-optimal in the case of the depolarizing noise at dis-tances except d = 9. We also see that the performance ofthe [4,8,8]-color code is better than that of [6,6,6]-colorcode. We speculate that this is because the number ofthe physical qubits required in the [4,8,8]-color codes issmaller than that of the [6,6,6]-color code at the samedistance. These results suggest that the neural decoderwith the uniform data construction is eﬀective also forthe color codes. IV. UTILIZING SPATIAL INFORMATION

In this section, we describe the construction of the neu-ral network with convolutional layers. We ﬁrst discusshow the required size of the data set is expected to besuppressed if the model can utilize the spatial informa-tion of the two-dimensional quantum codes. Then, weintroduce a construction of the neural network with con-volutional layers to utilize the spatial information of thetopological codes. We ﬁnally show the numerical results,and show that the performance of the neural decoder isimproved.

A. Importance of the spatial information

In this section, we utilize spatial information of thesyndrome by using Convolutional Neural Network (CNN)as a prediction model. When we use MLP model, each5 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) FIG. 6: The performance comparison between the neural decoder with the uniform construction (solid lines), theMD decoder (dashed lines), and the MWPM decoder (dotted lines) in the case of the [[ d , , d ]] surface code. Wecalculated the performance for distances d = 5 (red), 7 (green), 9 (blue), and 11 (cyan) with the same 10 trainingdata set. (a) The case of the bit-ﬂip noise. (b) The case of the depolarizing noise. Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) FIG. 7: The performance comparison between the neural decoder with the uniform construction (solid lines), theMD decoder (dashed lines), and the MWPM decoder (dotted lines) in the case of the [[ d , , d ]] surface code. Theneural decoder is trained with the 10 training data set. We calculated the performance for distances d = 5 (red), 7(green), 9 (blue), and 11 (cyan). (a) The case of the bit-ﬂip noise. The training data set is generated at the physicalerror probability p = 0 .

08. (b) The case of the depolarizing noise. The training data set is generated at the physicalerror probability p = 0 . L o g i c a l e rr o r p r o b a b ili t y [4,8,8]-color code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [4,8,8]-color code - Depolarizing noise (b) L o g i c a l e rr o r p r o b a b ili t y [6,6,6]-color code - Bit-flip noise (c) L o g i c a l e rr o r p r o b a b ili t y [6,6,6]-color code - Depolarizing noise (d) FIG. 8: The performance comparison between the neural decoder with the uniform construction (solid lines), andthe MD decoder (dashed lines) in the color codes. We calculated the performance for distances d = 3 (black), 5(red), 7 (green), and 9 (blue) with the 10 training data set. (a) The case of the bit-ﬂip noise in the [4,8,8]-colorcode. (b) The case of the depolarizing noise in the [4,8,8]-color code. (c) The case of the bit-ﬂip noise in the[6,6,6]-color code. (d) The case of the depolarizing noise in the [6,6,6]-color code.codes. In the case of the two-dimensional topologicalcodes, the syndrome values have natural two-dimensionalarrangement. By carefully reshaping syndrome values asa matrix-shaped arrangement of the feature vector ele-ments, we can explicitly let the model use local correla-tions of the observed syndromes using the CNN model.In the topological codes, a ﬂip of a single physical qubitinvokes at most a constant number (2 in the surface code,3 in the color code) of local bit-ﬂips in the syndromevalue. This implies that whether two (or three) ﬂippedsyndrome bits are found in a local region or not is usefulfor predicting the property of the physical errors. For in-tuitive understanding, we elaborate the reason throughexamples. We consider the surface code under bit-ﬂiperrors. Suppose that a syndrome vector s is given in theprediction phase, and the model has encountered slightlydiﬀerent syndrome vectors s A and s B , where the diﬀer-ence from s are shown in Fig. 9, in the training phase.The representation is the same as that of Fig. 4. We ignore boundary eﬀects of the topological codes for sim-plicity. In both cases, the syndrome is two hamming dis-tance away from the original syndrome vector s , namely, h ( s ⊕ s A ) = h ( s ⊕ s B ) = 2. On the other hand, there is adiﬀerence between the two syndromes in light of whetherit helps the prediction of the diagnosis for the observedsyndrome s . In Eq. (59), we introduced a set of physi-cal errors S ( s , w ∗ ; h ) such that any e ∈ S ( s , w ∗ ; h ) with h ∼ N ( H g ) − produces a training data useful for esti-mating the L2 diagnosis vector of s . For a given s and s (cid:48) , if there is no vector e δ such that H c Λ e T δ = s ⊕ s (cid:48) and h ( e δ ) (cid:46) N ( H g ) − , we see no errors e with s ( e ) = s (cid:48) arecontained in ∪ w ∈{ , } k S ( s, w ; N ( H g ) − ). In the case of s A and s B , there is such a physical error e δ with a smallhamming weight for s A , but not for s B . Thus, if the pre-diction model can distinguish the samples with s A fromthose with s B , it can recognize that the samples with s B in the training data set are not relevant to the pre-diction for s . The CNN model can distinguish it since it7FIG. 9: Example of the diﬀerence of the syndromevalues. Each node in the ﬁgure corresponds to eachsyndrome value, and edge corresponding to the errorstatus of each physical qubit. The color of the circlecorresponds to whether the syndrome measurementdetects an error.naturally utilizes the spatial information of the syndromevalues. On the other hand, the MLP model cannot easilydistinguish them since the model is not provided with therelevant spatial structure before training. This discussionimplies the logical error probability under the ﬁxed num-ber of the training data set is expected to be improvedwith the use of a CNN model. B. Construction of the network

A convolutional neural network extracts patterns fromimage data through trainable ﬁlters that activate (pro-duce a high value) when there are speciﬁc local patternsin the input data. The network usually consists of mul-tiple convolutional layers C ( n ) each of which consists ofdiﬀerent ﬁltered versions of the image data C ( n ) p , indexedby a channel number p . The ( n − Q chan-nels are ﬁltered to the n -th layer with P channels with Q × P ﬁlters which we can be represented by a matrix f ( n − ,n ) . We can describe this relation as follows. C ( n ) i,j,p = A ( (cid:88) d x (cid:88) d y (cid:88) q f ( n − ,n ) d x ,d y ,q,p C ( n − i + d x ,j + d y ,q + b ( n ) p ) , (90)where C ( n ) i,j,p is the ( i, j ) element of the p -th channel inthe n -th convolutional layer, and f ( n − ,n ) d x ,d y ,q,p is the ( d x , d y )element in the ( q, p )-th ﬁlter from the ( n − n -th layer. Parameter b ( n ) p is the bias added tothe p -th channel of the n -th layer. A simple example isshown in Fig. 10 where one layer has three channels andthe next layer has two channels.To use a CNN in our decoding task, we have to ex-press the syndrome vector s with an appropriate matrixrepresentation. We reallocate the syndrome vector forthe [[2 d − d + 1 , , d ]] and [[ d , , d ]] codes as shown inFig. 11. For the [[2 d − d +1 , , d ]] surface code, s is con-verted into two d × ( d −

1) matrices for the X syndromeand the Z syndrome. Similarly, for the [[ d , , d ]] surfacecode, s is converted into two ( d − × ( d + 1) / d for the ﬁrst two lay-ers and 5 d for the last layer. Details about the modelarchitecture is described in Appendix B. It is worth not-ing that we used the same ﬁlters for decoding of both X and Z ﬂip errors, and max-pooling is not used as it isobserved to reduce the performance of the decoder. C. Numerical result

We call a neural decoder with the MLP model as aMLP decoder, and one with the CNN model as a CNNdecoder. We compare the performance of the CNN de-coder with those of the MLP decoder, MD decoder, andMWPM decoder. Note that the training data set is gen-erated with the uniform data construction.First, we compare the performance of the CNN decoderand that of the MLP decoder in the case of the surfacecodes. The numerical results are shown in Fig. 13. Inthis ﬁgure, the solid lines and dashed lines are the logi-cal error probability for the CNN decoder and the MLPdecoder, respectively. The colors red, green, blue, andcyan correspond to distances d = 5, 7, 9 and 11, respec-tively. For both types of the surface codes, the CNNdecoder shows superior performances to that of the MLPdecoder at large distances. In particular, in the case ofthe [[2 d − d + 1 , , d ]] surface code, the CNN decoder8 FIG. 10: A simple case of convolutional layer where theinput channel is three and the output channel is two.shows signiﬁcant improvement of a logical error probabil-ity. We see that the CNN model is eﬀective for improvingthe performance of the neural decoder at large distances.On the other hand, we see that the CNN decoder showsinferior performances to the MLP decoder at a small dis-tance. We speculate the reason of this as follows. TheCNN model assumes that the local features can be ex-tracted by using the same ﬁlter everywhere. Such anassumption is not necessarily true when the distance issmall, since almost all the ﬁltered local regions, of size3 × . −

1, but the performance in the smalldistance did not improve.Next, we compared the performance of the CNN de-coder with those of the MD decoder and the MWPM de-coder. The results are shown in Fig. 14. The solid lines,the dashed lines, and the dotted lines are the logical errorprobability for the CNN decoder, the MD decoder, andthe MWPM decoder, respectively. The colors red, green,blue, and cyan correspond to distances d = 5, 7, 9 and11, respectively. In the case of the bit-ﬂip noise, we seethat the logical error probabilities of the CNN decoder isequal to or slightly better than that of the MD decoder.In the case of the depolarizing noise, though there aregaps between the performances of the CNN decoder andthe MD decoder, the performance of the CNN decoderis superior or comparable to that of the MWPM decodereven at the distance d = 11.We also calculated the logical error probability of theCNN decoder at a small physical error probability p inthe case of the [[2 d − d + 1 , , d ]] surface code. Wetrained the CNN decoder at p = 0 .

08 for the bit-ﬂip noisemodel, and at p = 0 .

11 for the depolarizing noise model.Then, the decoder is tested with the data set generatedwith small physical error probabilities. The plots are shown in Fig. 15. In the case of the bit-ﬂip noise, theCNN decoder achieves the performance close to the MDdecoder also at small physical error probabilities. In thecase of the depolarizing noise, the performance of theneural decoder with CNN decoder is superior to that ofthe MWPM decoder at d = 9, and comparable at d = 11.We can say that the CNN model is eﬀective also for a useof neural decoders at small physical error probabilities. V. CONCLUSION

In this paper, we theoretically analyzed mechanism ofmachine-learning-based decoders for QEC, and proposeda general direction to construct the data set and the neu-ral network. Then, we have numerically shown that ourdirection is eﬀective compared with the existing works.Since the formalism of the machine learning is ﬂexi-ble, there are many possible ways to reduce the decodingproblem in QEC to the task of the machine learning. Inorder to clarify what is the best way of reduction, weintroduced the linear prediction framework. This frame-work essentially includes the existing methods as speciﬁccases, and enables us to discuss conditions for satisfy-ing natural requirements for a good decoder for QEC.In particular, we have derived the condition to performthe optimal decoding in the limit of a large training datasize. We also introduced a measure, normalized sensi-tivity, which represents a properly-scaled bound on thedeviation in the prediction target resulting from a smallchange in the physical error pattern. We proposed touse this measure as a criterion for constructing a betterdecoder. We then proposed a general direction for con-structing the data set, uniform data construction, whichcan be applicable to general topological codes. We nu-merically conﬁrmed that the performance of the neuraldecoder is improved with the uniform data construc-tion. Our decoder was found to be superior to knowneﬃcient decoders, such as neural decoders proposed inthe existing methods and the decoder based on the re-duction to minimum-weight perfect matching. We alsoconﬁrmed that the performance of our neural decoder isnear-optimal in various situations by comparing it withthe minimum-distance decoder, which is known to benear-optimal but not eﬃcient in general. We also con-ﬁrmed that the neural decoder can achieve near-optimalperformance not only for surface codes but also for colorcodes.Another important factor of the neural decoder is con-struction of the neural network. We discussed the im-portance of the spatial information of the syndrome mea-surement in order to let the prediction model recognizeuseful samples from a given training data set. To utilizethe spatial information, we proposed a neural decoderwith the convolutional neural network. We numericallyobserved that the performance of the neural decoder isfurther improved with this network construction in thesurface code. In particular, we showed that the proposed9FIG. 11: The above ﬁgure shows how to split and reallocate the syndrome vectors to the two input layer of theneural network. In the case of the [[2 d − d + 1 , , d ]]-code, the lattice is split into a ( d − × d array of syndrome,and π/ d × ( d −

1) matrix as the ﬁrst layer of the neural network. In the case of the[[ d , , d ]]-code, we split the syndromes into two ( d − × ( d +1)2 arrays. Convolutional layer 1 Convolutional layer 2 Convolutional layer 3 Hidden layer Concatenation Prediction

FIG. 12: CNN decoder architecture used in our work.We separately pass X and Z syndrome values throughthe same convolutional layers, and concatenate thembefore feeding to the following fully-connected hiddenlayer.neural decoder achieves a smaller logical error probabil-ity than that of the decoder based on minimum-weightperfect matching even at distance d = 11 with a trainingdata set size of 10 .Since using machine learning for QEC is an emergentﬁeld, there are still many possible extensions and direc-tions of the neural decoders. As we detailed in Appendix B, the prediction time of the neural decoders is smallerthan that of the MD decoder, but larger than that ofthe MWPM decoder in our desktop PC. Since the pre-diction of the neural decoders can be done with simplematrix multiplications, the time for prediction can be fur-ther made short by using an optimized hardware such asﬁeld-programmable gate array (FPGA), which is popu-larly used in experiments. While we have discussed onlya label linearly generated in GF(2), the performance maybe more improved by allowing labels nonlinearly gener-ated from the physical error. For example, the relationbetween the syndrome values and the weight of the phys-ical error, which cannot be generated linearly in GF(2),can be trained and predicted independently with a neuralnetwork. Then, the recovery map can be predicted withthe syndrome values and the predicted weight with an-other neural network. The linear prediction frameworkalso limits the sample in the training data set to thatis sampled from the assumed physical error distribution.However, the distribution which is the best for the train-ing is not necessarily the same as the actual distribution.For example, we saw that the prediction model trained atthe physical error probability around the threshold valueshows high-performance also at low physical error prob-abilities. There can be a more artiﬁcial way to constructthe training data set to achieve the performance with asmaller size of the training data set. In the numericalinvestigation, we observed that the required amount ofthe data set becomes exponentially large in terms of thedistance. This may be suppressed by renormalizing the0 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Bit-flip noise (c) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Depolarizing noise (d) FIG. 13: The performance comparison between the CNN decoder (solid lines) and the MLP decoder (dashed lines)in the case of the surface code. We calculated the performance for distance d = 5 (red), 7 (green), 9 (blue), and 11(cyan). (a) The bit-ﬂip noise in the [[ d , , d ]] code. (b) the depolarizing noise in [[ d , , d ]] code. (c) the bit-ﬂip noisein [[2 d − d + 1 , , d ] code. (d) the depolarizing noise in [[2 d − d + 1 , , d ]] code.matrix representation of the syndrome with trained ﬁl-ters as done in the renormalization group decoder [19].We expect that CNN is also applicable to the color codesby using non-rectangle ﬁlters. When the stabilizer mea-surements themselves suﬀer from noise, stabilizer mea-surements are often repetitively performed during QEC.In such a case, the length of the syndrome data is notﬁxed. In our construction, we need to train the neuralnetwork again whenever the length of the syndrome datachanges. The studies of Refs. [26, 28] focused on remov-ing this drawback by utilizing recurrent neural networkand convolutional neural network. Using the techniqueproposed in Refs. [26, 28], our neural decoder may be alsoapplicable to the cases when we perform repetitive sta-bilizer measurements. ACKNOWLEDGEMENTS

This work is supported by KAKENHI Grant No.16H02211; PRESTO, JST, No. JPMJPR1668; CREST,JST, Grants No. JPMJCR1671, No. JPMJCR1673;ERATO, JST, Grant No. JPMJER1601; and PhotonFrontier Network Program, MEXT. Y.S. is supported byAdvanced Leading Graduate Course for Photon Science.D.A. and Y.S. contributed equally to this work. Y.S.contributes to the construction of the data set in Sec. III,and A.D. to the construction of the network in Sec. IV.K.F. and M.K. motivate and supervise the idea and dis-cussion of this paper.1 L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Bit-flip noise (a) L o g i c a l e rr o r p r o b a b ili t y [[d^2,1,d]] surface code - Depolarizing noise (b) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Bit-flip noise (c) L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Depolarizing noise (d) FIG. 14: The performance comparison between the CNN decoder (solid lines), the MD decoder (dashed lines), andthe MWPM decoder (dotted lines) in the surface codes. We calculated the performance for distance d = 5 (red), 7(green), 9 (blue), and 11 (cyan). (a) the bit-ﬂip noise in the [[ d , , d ]] code. (b) the depolarizing noise in [[ d , , d ]]code. (c) the bit-ﬂip noise in [[2 d − d + 1 , , d ]] code. (d) the depolarizing noise in [[2 d − d + 1 , , d ]] code. APPENDIX A: PROOF OF THE LEMMASProof of the converse part in Lemma III.1

Here we prove the last statement of Lemma III.1.When Eq. (19) does not hold, either (i) there exists e such that e / ∈ L , (91) H cg Λ e T1 = 0 , (92)or (ii) there exists e such that e ∈ L , (93) H cg Λ e T1 (cid:54) = 0 . (94)For (i), consider two probability distributions { p e } and { p (cid:48) e } such thatPr e ∼{ p e } [ e = 0 | s ( e ) = 0] = 0 . , (95)Pr e ∼{ p e } [ e = e | s ( e ) = 0] = 0 . , (96)and Pr e ∼{ p (cid:48) e } [ e = 0 | s ( e ) = 0] = 0 . , (97)Pr e ∼{ p (cid:48) e } [ e = e | s ( e ) = 0] = 0 . . (98)An optimal decoder for each case succeeds with probabil-ity 0 .

75 given s = 0. On the other hand, since g ( δ ) (0) = 0in both cases, only the value of r ∗ (0 ,

0) is relevant. Since w (0) (cid:54) = w ( e ), any choice of r ∗ (0 ,

0) leads to a successprobability no greater than 0.25 for at least one of thecases.For (ii), choose w (cid:54) = 0, and if H g Λ( w G ) T (cid:54) = 0, deﬁne e := w G. (99)2 Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Bit-flip noise (a) Physical error probability10 L o g i c a l e rr o r p r o b a b ili t y [[2d^2-2d+1,1,d]] surface code - Depolarizing noise (b) FIG. 15: The performance comparison between the CNN decoder (solid lines) and the MD decoder (dashed lines),and the MWPM decoder (dotted lines) in the case of the [[2 d − d + 1 , , d ]] surface code, where the decoders aretrained with the training data set generated at the ﬁxed error rate. We calculated the performance for distance d = 5 (red), 7 (green), 9 (blue), and 11 (cyan). (a) The case of the bit-ﬂip noise. The training data set is generatedat the physical error probability p = 0 .

08. (b) The case of the depolarizing noise. The training data set is generatedat the physical error probability p = 0 . e := e ⊕ w G. (100)It ensures that s ( e ) = 0 and g := H g Λ e (cid:54) = 0. Considertwo probability distributions { p e } and { p (cid:48) e } such thatPr e ∼{ p e } [ e = 0 | s ( e ) = 0] = 0 . , (101)Pr e ∼{ p e } [ e = e | s ( e ) = 0] = 0 . , (102)Pr e ∼{ p e } [ e = e | s ( e ) = 0] = 0 . , (103)and Pr e ∼{ p (cid:48) e } [ e = 0 | s ( e ) = 0] = 0 . , (104)Pr e ∼{ p (cid:48) e } [ e = e | s ( e ) = 0] = 0 . , (105)Pr e ∼{ p (cid:48) e } [ e = e | s ( e ) = 0] = 0 . . (106)An optimal decoder for each case succeeds with probabil-ity 0 . s = 0. On the other hand, since g ( δ ) (0) = g in both cases, only the value of r ∗ ( g ,

0) is relevant. Since w (0) (cid:54) = w ( e ), any choice of r ∗ ( g ,

0) leads to a successprobability no greater than 0 . Proof of the converse part in Lemma III.2

When the diagnosis matrix is not decomposable, thereexists a non-empty subset

W ⊂ { , } k such that (cid:88) w ∈W α w g ( w ) = (cid:88) w ∈{ , } k \W β w g ( w ) , (107)where α w , β w ≥ (cid:88) w ∈W α w = (cid:88) w ∈{ , } k \W β w > . (108)Consider two probability distributions { p e } and { p (cid:48) e } such thatPr e ∼{ p e } A [ w ( e ) = w , l ( e ) = l | s ( e ) = 0] = (cid:40) α w / Γ w ∈ W , l = 00 otherwise , (109)Pr e ∼{ p (cid:48) e } B [ w ( e ) = w , l ( e ) = l | s ( e ) = 0] = (cid:40) β w / Γ w / ∈ W , l = 00 otherwise . (110)From Eq. (107), the L2 diagnosis vector g (L2) (0) isidentical for the two distributions. On the other hand,the most probable class w is diﬀerent for the two prob- ability distributions. This means that a single decodercannot perform the optimal decoding for both of the twodistributions.3 Proof of the existence of faithful and decomposablediagnosis matrices

In the main text, we showed that a diagnosis matrixshould be faithful and decomposable for performing op-timal decoding in the ideal limit of the training process,and showed an example for the case k = 1. On the otherhand, it is not trivial that there exists a faithful anddecomposable construction of a diagnosis matrix for anarbitrary stabilizer code. We show that diagnosis matrix H g = W G , where W is a 2 k × k binary matrix of whichthe i -th row is a 2 k -bit binary representation of an integer i , is always faithful and decomposable for an arbitrarystabilizer code and for an arbitrary number of logicalqubits k . Since row vectors of H g contains all the logicaloperators, it is trivial that span( { ( H cg ) i } ) is equivalentto the logical space L , and H g is faithful. The condi-tion for decomposability is equivalent to the conditionthat { g ( w ) | w ∈ { , } k } , where g ( w ) := H g Λ( w G ) T ,is aﬃnely independent in real vector space. To showthe latter, we ﬁrst prove that for any pair of binary vec-tors w , w (cid:48) ∈ { , } k such that w (cid:54) = w (cid:48) , the weight of g ( w ) ⊕ g ( w (cid:48) ) is 2 k − . A 2 k -bit sequence g ( w ) ⊕ g ( w (cid:48) )is given by g ( w ) ⊕ g ( w (cid:48) ) = W G Λ G T ( w ⊕ w (cid:48) ) T . (111)Since matrix G Λ G T is invertible and since w ⊕ w (cid:48) (cid:54) = 0,we have G Λ G T ( w ⊕ w (cid:48) ) T (cid:54) = 0. Since matrix W containsall the possible 2 k -bit sequence, the half elements in thesequence g ( w ) ⊕ g ( w (cid:48) ) are 1, and the others are 0. Thus,the weight of g ( w ) ⊕ g ( w (cid:48) ) is 2 k − .Let v := (1 , . . . , T be a real vector of order 2 k .We deﬁne a set of vectors h ( w ) := 2 g ( w ) − v for w ∈ { , } k , where this calculation is done in real vectorspace. Note that this map is equivalent to replace 0 and1 to 1 and −

1, respectively. Since this map from g ( w ) to h ( w ) is aﬃne, { g ( w ) } is aﬃnely independent if { h ( w ) } is linearly independent. The inner product h ( w ) h ( w (cid:48) ) T for w (cid:54) = w (cid:48) can be calculated as h ( w ) h ( w (cid:48) ) T = (cid:88) i h ( w ) i h ( w (cid:48) ) i = 2 k − w ( g ( w ) ⊕ g ( w (cid:48) ))= 0 . (112)We used the fact that h ( w ) i h ( w (cid:48) ) i is 1 if g ( w ) i = g ( w (cid:48) ) i ,and − { h ( w ) } is lin-early independent in the real vector space, and the set ofvectors { g ( w ) } is aﬃnely independent. This means that H g = W G is faithful and decomposable for an arbitrarystabilizer code.

APPENDIX B : ADDITIONAL INFORMATIONFOR THE IMPLEMENTATION OF THEDECODERS

We describe the detail of the implementation of ourmodel, training process, and decoders for the reference.

Distance Filter size Channelnumber Neuronnumber5 [2x2],[3x3],[3x3] 50, 50, 25 10007 [2x2],[3x3],[4x4] 70, 70, 35 30009 [3x3],[4x4],[5x5] 90, 90, 45 500011 [4x4],[5x5],[6x6] 110, 110, 55 7000

TABLE I: Network architecture of the[[2 d − d + 1 , , d ]] surface code.We chose rectiﬁed linear units (ReLU(x)=max(0,x)) anda sigmoid function (S(x)=1 / (1 + e − x )) as the activationfunction for the hidden layer and that for the ﬁnal outputlayer, respectively. Batch normalization was deployed inall of our models and was found to be eﬀective. We alsoused L2 regularization to avoid over-ﬁtting of the model.In the training phase, the Adam optimization method[33] was used. The learning rate was exponentially de-creased, and its schedule was optimized by hand. Thenetwork was built with the tensorﬂow v1.2 platform. Details about the multilayer perceptron

We optimized the following parameters of multilayerperceptron using grid-search: number of neurons perlayer ( β ). The parameters were searched in therange ∈ { d , d , d } , β ∈ { , . , . } , batch ∈{ , } , and ∈ { , , } . Note that in the caseof d = 11, we tuned d due to the memory limit of GPU.We started the training with learning rate 10 − , and itwas decreased to 10 − according to a schedule which wasoptimized by hand. We optimized these parameters foreach construction of the diagnosis matrix, distance, phys-ical error probability, error model, and size of the train-ing data set. We chose the conﬁguration which achievesthe smallest logical error probability for an independentlygenerated a validation data set of size 10 . Then, the log-ical error probability is calculated using another 10 testdata set. Details about the Convolutional Neural Network

Our CNN model consists of three convolutional layerson top of a single fully-connected hidden layer. For eachconvolutional layer, the channel number was chosen tobe 10 d for the ﬁrst two layers and 5 d for the last layer.We chose batch size as 100 in the training of the CNNmodel.The network architecture was the same for both bit-ﬂipand depolarizing noise models in the [[2 d − d + 1 , , d ]]surface code, and is described in TABLE I. As for the[[ d , , d ]] code with the bit-ﬂip and depolarizing noise4 Distance Filter size Channelnumber Neuronnumber5 [2x2],[3x3],[3x3] 50, 50, 25 10007 [2x2],[3x3],[3x4] 70, 70, 35 30009 [2x3],[3x4],[4x5] 90, 90, 45 500011 [2x4],[3x5],[4x6] 110, 110, 55 7000

TABLE II: Network architecture of the [[ d , , d ]]surface code.models, we used the network architecture described inTABLE II. The ﬁlter stride was set to 1 in all directions. Implementation of the minimum-distance decoder

The minimum-distance decoder of the surface code un-der the bit-ﬂip noise can be implemented by reducingthe problem into the minimum-weight perfect match-ing. The minimum-weight perfect matching can be ef-ﬁciently solved with Blossom algorithm [31]. We usedKolmogorov’s implementation of Blossom algorithm [34].In the other cases, we reduced the problem into the fol-lowing instance of integer programming.Minimize w ( e ) s . t . H c Λ e T = s (113)This problem was solved with IBM ILOG CPLEX. Weobtained at least 10 samples for each plot. In all thecases, the solver reached the optimal solution. Time for single prediction, implementation andenvironment

We measured the time for single decoding on the[[2 d − d + 1 , , d ]] surface code with d = 11 and p = 0 .

15 under the depolarizing noise for the MD de-coder, MWPM decoder, and the proposed neural de-coders with the MLP and CNN models. Note that thetimes of the MD decoder and the MWPM decoder de-pend on the physical error probability.We used IBM ILOG CPLEX via python-wrapper forconstructing the MD decoder. The program was exe-cuted on Intel Xeon E5-2687W v4 with default settings.The MD decoder takes about 330 milliseconds per decod-ing. Note that the time may be improved by optimizingthe settings of CPLEX.The Kolmogorov’s implementation of Blossom algo-rithm [31, 34] was used for the MWPM decoder. Wecompiled the codes with Microsoft Visual C++ 2015 andwith O2 option. The program was executed on IntelCore i7-6700 without parallelization. The MWPM de-coder took about 56 microseconds per decoding. The proposed neural decoders were implemented withpython and tensorﬂow. We measured the time for sin-gle prediction when we set batch size as 1, the numberof layer as 2, the number of units per layer as 7000 forthe MLP model. The conﬁguration of the CNN modelis shown in TABLE I. The computation was performedusing Intel Core i7-6700 and GeForce GTX 1060 6GB.The proposed neural decoders with the MLP and CNNmodels took 2.2 milliseconds and 7 milliseconds, respec-tively, for feed-forwarding the input data and ﬁnding themost probable class w . Since the prediction of the neuraldecoders can be done with simple matrix multiplications,we expect that the time for single prediction of the neu-ral decoder can be made shortened by using an optimizedhardware, such as FPGA, for example. APPENDIX C : THE SPECIFIC CHOICES OFTHE UNIFORM DATA CONSTRUCTION

We have introduced the uniform data construction inSec. III. In this appendix, we show speciﬁc uniform dataconstruction for the surface and color codes.We choose 3 d logical operators for the [[2 d − d +1 , , d ]] surface code by using two patterns as shown inFig. 16. For pattern 1, each dotted line corresponds to alogical X operator, which is the product of the Pauli Z operators on the vertices on the line. For pattern 2, eachdotted line corresponds to a logical Z operator, whichis the product of the Pauli X operators on the verticeson the line. We choose d logical Y operators written asthe product of the i -th logical X operator and the i -thlogical Z operator for i = 0 , . . . , d −

1. We choose 3 d logical operators for the [[ d , , d ]] surface code with twopatterns as shown in Fig. 17. The rule of choice is thesame as that of the [[ d , , d ]] surface code.We choose ( d + 1) logical operators for the [6,6,6]-color code as shown in Fig. 18. There are ( d + 1) linesfor each pattern. In all of the three patterns, each linecorresponds to the logical X -, Z -, and Y -operators on thephysical qubits on the line. We choose 6( d + 1) logicaloperators for the [4,8,8]-color code as shown in Fig. 19.There are ( d + 1) lines for each pattern. The choice ofthe logical operators is the same as that of the [6,6,6]-color codes.In all the patterned choice of the logical operators, wecan verify that the sensitivity is constant, since everyphysical qubit is measured by at most constant numberof logical operators. On the other hand, the minimumboundary distance is scaled as O ( d ), since the same num-ber O ( d ) of logical X -, Y -, and Z -operators are used.Thus, the normalized sensitivity is scaled as O ( d − ) withthese choices.5 (a) Pattern 1 (b) Pattern 2 FIG. 16: Logical operators used for the construction of a diagnosis matrix for the surface codes [[2 d − d + 1 , , d ]].Each dotted black line corresponds to a chosen logical operator. (a) Pattern 1 (b) Pattern 2 FIG. 17: Logical operators used for the construction of a diagnosis matrix for the surface codes [[ d , , d ]]. Eachdotted black line corresponds to a chosen logical operator. (a) Pattern 1 (b) Pattern 2 (c) Pattern 3 FIG. 18: Logical operators used for the construction of a diagnosis matrix for the [6,6,6]-color codes. Each coloredline corresponds to chosen logical operators. The lines are colored only for visibility, and are not related to thecolors of color codes.6 (a) Pattern 1 (b) Pattern 2(c) Pattern 3 (d) Pattern 4

FIG. 19: Logical operators used for the construction of a diagnosis matrix for the [4,8,8]-color codes. Each coloredline corresponds to chosen logical operators. The lines are colored only for visibility, and are not related to thecolors of color codes. [1] A. Y. Kitaev, Russian Mathematical Surveys , 1191(1997).[2] D. Aharonov and M. Ben-Or, in Proceedings of thetwenty-ninth annual ACM symposium on Theory of com-puting (ACM, 1997) pp. 176–188.[3] E. Knill, R. Laﬂamme, and W. H. Zurek, in

Proceedingsof the Royal Society of London A: Mathematical, Physicaland Engineering Sciences , Vol. 454 (The Royal Society,1998) pp. 365–384.[4] J. Kelly, R. Barends, A. Fowler, A. Megrant, E. Jeﬀrey,T. White, D. Sank, J. Mutus, B. Campbell, Y. Chen, et al. , Nature , 66 (2015).[5] A. C´orcoles, E. Magesan, S. J. Srinivasan, A. W. Cross,M. Steﬀen, J. M. Gambetta, and J. M. Chow, Naturecommunications (2015).[6] D. Rist`e, S. Poletto, M.-Z. Huang, A. Bruno, V. Vester-inen, O.-P. Saira, and L. DiCarlo, Nature communica-tions (2015).[7] A. Y. Kitaev, Annals of Physics , 2 (2003).[8] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, Jour- nal of Mathematical Physics , 4452 (2002).[9] D. A. Lidar and T. A. Brun, Quantum error correction (Cambridge University Press, 2013).[10] S. B. Bravyi and A. Y. Kitaev, arXiv preprint quant-ph/9811052 (1998).[11] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N.Cleland, Physical Review A , 032324 (2012).[12] C. Wang, J. Harrington, and J. Preskill, Annals ofPhysics , 31 (2003).[13] D. S. Wang, A. G. Fowler, and L. C. L. Hollenberg,Physical Review A , 020302 (2011).[14] A. G. Fowler, A. C. Whiteside, and L. C. L. Hollenberg,Phys. Rev. Lett. , 180501 (2012).[15] A. M. Stephens, Physical Review A , 022321 (2014).[16] M.-H. Hsieh and F. Le Gall, Physical Review A ,052331 (2011).[17] H. Bombin and M. A. Martin-Delgado, Journal of math-ematical physics , 052105 (2007).[18] N. Delfosse, Physical Review A , 012317 (2014).[19] G. Duclos-Cianci and D. Poulin, Physical review letters , 050504 (2010).[20] E. Magesan, J. M. Gambetta, A. C´orcoles, and J. M.Chow, Physical review letters , 200501 (2015).[21] G. Carleo and M. Troyer, Science , 602 (2017).[22] J. Carrasquilla and R. G. Melko, Nature Physics , 431(2017).[23] J. Romero, J. P. Olson, and A. Aspuru-Guzik, QuantumScience and Technology , 045001 (2017).[24] G. Torlai and R. G. Melko, Physical Review Letters ,030501 (2017).[25] S. Varsamopoulos, B. Criger, and K. Bertels, QuantumScience and Technology , 015004 (2017).[26] P. Baireuther, T. E. O’Brien, B. Tarasinski, and C. W.Beenakker, Quantum , 48 (2018). [27] S. Krastanov and L. Jiang, Scientiﬁc reports , 11003(2017).[28] N. P. Breuckmann and X. Ni, Quantum , 68 (2018).[29] D. Gottesman, arXiv preprint quant-ph/9705052 (1997).[30] G. Duclos-Cianci and D. Poulin, in Information TheoryWorkshop (ITW), 2010 IEEE (IEEE, 2010) pp. 1–5.[31] J. Edmonds, Canadian Journal of mathematics , 449(1965).[32] K. Hornik, M. Stinchcombe, and H. White, Neural net-works , 359 (1989).[33] D. Kingma and J. Ba, arXiv preprint arXiv:1412.6980(2014).[34] V. Kolmogorov, Mathematical Programming Computa-tion1