[PDF] Neural network setups for a precise detection of the many-body localization transition: finite-size scaling and limitations

Abstract

Determining phase diagrams and phase transitions semi-automatically using machine learning has received a lot of attention recently, with results in good agreement with more conventional approaches in most cases. When it comes to more quantitative predictions, such as the identification of universality class or precise determination of critical points, the task is more challenging. As an exacting test-bed, we study the Heisenberg spin-1/2 chain in a random external field that is known to display a transition from a many-body localized to a thermalizing regime, which nature is not entirely characterized. We introduce different neural network structures and dataset setups to achieve a finite-size scaling analysis with the least possible physical bias (no assumed knowledge on the phase transition and directly inputing wave-function coefficients), using state-of-the-art input data simulating chains of sizes up to L=24. In particular, we use domain adversarial techniques to ensure that the network learns scale-invariant features. We find a variability of the output results with respect to network and training parameters, resulting in relatively large uncertainties on final estimates of critical point and correlation length exponent which tend to be larger than the values obtained from conventional approaches. We put the emphasis on interpretability throughout the paper and discuss what the network appears to learn for the various used architectures. Our findings show that a it quantitative analysis of phase transitions of unknown nature remains a difficult task with neural networks when using the minimally engineered physical input.

Full PDF

PPHYSICAL REVIEW B , 224202 (2019)

Neural network setups for a precise detection of the many-body localization transition:Finite-size scaling and limitations

Hugo Théveniaut and Fabien Alet

Laboratoire de Physique Théorique, IRSAMC, Université de Toulouse, CNRS, UPS, 31062 Toulouse, France (Received 9 May 2019; revised manuscript received 10 October 2019; published 2 December 2019)Determining phase diagrams and phase transitions semiautomatically using machine learning has received alot of attention recently, with results in good agreement with more conventional approaches in most cases. Whenit comes to more quantitative predictions, such as the identiﬁcation of universality class or precise determinationof critical points, the task is more challenging. As an exacting testbed, we study the Heisenberg spin-1 / L =

24. In particular, we use domain adversarial techniques to ensure that the networklearns scale-invariant features. We ﬁnd a variability of the output results with respect to network and trainingparameters, resulting in relatively large uncertainties on ﬁnal estimates of critical point and correlation lengthexponent which tend to be larger than the values obtained from conventional approaches. We put the emphasison interpretability throughout the paper and discuss what the network appears to learn for the various usedarchitectures. Our ﬁndings show that a quantitative analysis of phase transitions of unknown nature remains adifﬁcult task with neural networks when using the minimally engineered physical input.DOI: 10.1103/PhysRevB.100.224202

I. INTRODUCTION

The recent application of machine learning techniques tocondensed matter and statistical physics led to several andimportant successes in various problems, ranging from thedetection of phases of matter from synthetic [1–9] or exper-imental data [10,11], wave-function reconstruction [12], theimprovement of variational

Ansätze for quantum problems[13–18], and efﬁcient Monte Carlo sampling [19–21], in sucha way that machine learning is now regarded as a new tool forthe study of complex, interacting, (quantum) physical systems[22–24].In the case of phase identiﬁcation, the semiautomatic dis-covery of phase transitions and mapping of phase diagramsrely on the ability of machine learning algorithms to extractthe relevant features for the classiﬁcation of samples fromlarge datasets. Data consist for instance of Monte Carlo snap-shots of conﬁgurations or measurements of various types ofobservables. This approach has enabled the recovery of knownphase diagrams or the location of phase transitions with qual-itative agreement with more conventional approaches (basedfor instance on order parameters, and / or theory of ﬁnite-sizescaling), achieving this at a much lower computational cost ,e.g., using fewer samples or smaller system sizes. In somecases, critical exponents have been extracted by an analysisof the neural networks (NN) outputs [1,25–28], a particularlynontrivial prediction.There are cases however where machine learning tech-niques fail to capture the correct physical behavior, or atleast not as correctly as the conventional approaches do [3,6,7,29,30]. The selection of the input data is paramount inthis method, as providing engineered quantities known to havea physical content naturally helps NN to be more accurateusing even more modest resources (input data or network size)[3,31]. It can indeed sometimes require a bit of manual featureengineering to accurately grasp the physical behavior in atransition region [3,29,31].A natural question is whether machine learning can leadto superior results in the case of unknown phase diagrams orfor systems where conventional approaches have difﬁculties.One speaking modern example, and the focus of the currentwork, is the many-body localization (MBL) transition in onedimensional quantum disordered systems. There, ﬁnite-sizeeffects are crucial to apprehend the transition as the size ofavailable samples is limited (contrary to classical or quantumproblems that can be treated with Monte Carlo simulations).There is furthermore no accepted ﬁnite-size scaling theory forthis transition. Conventional approaches based on the exten-sive study of various physical quantities provide an estimateof the phase transition [32], but may be hampered by ﬁnite-size effects (for the standard MBL model considered in thiswork, the maximum sample sizes that can be used to probethe transition regime is L =

24 [33]). For instance, attemptsat performing ﬁnite-size scaling [32,34] result in a criticalexponent for the correlation / localization length which doesnot match predictions from renormalization group approachesand do not fulﬁll a bound argued to be valid for the MBLphase transition.In this work, we provide a detailed analysis of a neuralnetwork, from its construction to the treatment of its output, , 224202 (2019) designed to locate the MBL phase transition in a prototypical1D quantum model. Our goal is not to engineer the bestnetwork architecture that reproduces the known estimate ofthe phase transition with the smallest amount of input data,but rather to see if we can go beyond by using the same (high)quality of input data. In short, we ask the following question:can a NN approach provide a quantitative description of thetransition (not just qualitative), and in particular, improvedetermination of critical point and exponents? In doing so andas a probe of the efﬁciency of this approach for unknownphase diagrams, we furthermore wish to provide the leastengineered input.There have been several prior works that used NN ap-proaches to locate the MBL phase transition in 1D disorderedquantum systems. One group of works [2,35–39] consideredinputing the entanglement spectrum of eigenstates, resultingin phase diagrams in (qualitative) agreement with conven-tional approaches. One notes however that the entanglementspectrum is a high-level engineered quantity where physicalfeatures are already extracted (as for instance the two phasesaround the MBL transition have a different scaling behaviorfor the entanglement entropy), and that there was no sys-tematic study of ﬁnite-size effects on the prediction of thenetwork. Other works [40,41] considered locating the phasetransition using information from dynamical measurements(such as time traces of observables after a quench). Whilelarger systems can be used using this approach, ﬁnite-time ef-fects (contrary to eigenstates which are probes of inﬁnite-timebehavior) may be relevant, especially close to the transition.The question of which observable to input also leaves moreroom for feature engineering in this case. Nevertheless, thisapproach may be particularly relevant for experiments whichprobe the MBL transition [42–45], and which are preciselybased on measurements of ﬁnite-time traces after a quench.Finally, two recent works [26,27] considered inputing directlythe eigenfunctions in order to detect the MBL transition, thistime supplemented by a ﬁnite-size scaling analysis.In this work, we follow this last approach of inputting thewave function, as this is likely the best approach to provideunbiased information to the network (see discussion below inSec. III A). We provide a detailed analysis of the inﬂuence ofthe network architecture and hyperparameters to the predic-tions. Quite crucially, we consider input data obtained fromlarge system sizes up to L =

24 spins, that is at, or beyond, thestate-of-the-art numerics used with conventional approaches.The plan of the manuscript is as follows. Section II describesthe lattice model used in this study and brieﬂy recapitulatesaspects of its MBL transition. In Sec. III, we provide anextensive description on possible NN setups to study the tran-sition, discussing the choice of input data (Sec. III A), networkarchitectures (Sec. III B), and hyperparameters (Sec. III C),as well as output treatment (Sec. III D). The remainder ofthe paper presents our results using three different setups: asingle-size training setup where an ensemble of NN are sepa-rately trained on different system sizes (Sec. IV), a multisizetraining setup where one NN is trained on a dataset containingdata from multiple system sizes all at once (Sec. V) includinga constraint in the form of a domain adversarial compo-nent to achieve better generalization (Sec. VI). Section VIIcritically discusses these results and summarizes the open questions and challenges for the detection of MBL physicsusing NN.

II. MODEL AND MBL TRANSITION

Many-body localization (see Refs. [46–50] for introductionand reviews) is an active research area which aims at under-standing the possibility of survival of Anderson localization inmany-body strongly interacting quantum systems. Existenceof the transition to a MBL phase is now accepted for one-dimensional lattice models in the presence of strong-enoughdisorder. The hallmarks of MBL include low-entanglement ineigenstates (even in the middle of the many-body spectrum),absence of thermalization (the eigenstate thermalization hy-pothesis [51,52] is not respected) and validity of thermody-namic ensemble, emergence of integrability (through the formof local integrals of motions), memory of initial conditions inquench setups, etc. [46–50]. All these speciﬁcities have beenused as probes of the existence of a MBL phase in variousstudies, however here we would like not to impose the useof any speciﬁc probe but use machine learning to detect thetransition to MBL.We perform computations on the standard lattice model ofMBL, namely, the spin 1 / H = L (cid:2) i = S i · S i + − L (cid:2) i = h i S zi . (1)Here, S = ( S x , S y , S z ) denotes a vector of spin-1 / (cid:2) = ( E − E min ) / ( E max − E min ) = .

5, where E min / max are theenergy spectrum extrema. For (cid:2) = .

5, the transition hasbeen estimated to occur at h c (cid:2) . ξ ∼ | h − h c | − ν , the ﬁnite-size analysis of Ref. [32] estimated a value ν (cid:2) . , . L = ν (cid:2) . ν = ∞ (Kosterlitz-Thouless scenario advocated in recent works [58–60]). Also,the Harris-Chayes criterion which has been argued [61] tohold also for the MBL transition provides a bound ν (cid:2) , 224202 (2019) III. BUILDING A NEURAL NETWORK TO STUDYTHE MBL TRANSITION

There are many possible ways to design a neural networkaimed at detecting the MBL transition, which can vary fromthe choice of input data, architecture and hyperparametersof the network, as well as the interpretation of the networkoutput. In order to perform these choices, we are guided bythe following principles: (i) minimal manual feature engineer-ing: that is, we want the input data not to be preprocessedwith already-extracted physical features that could bias thepredictions, (ii) scalability: the network should be able totreat data from different system sizes (in order to performﬁnite-size scaling), (iii) keep variability with respect to ir-relevant (unphysical) parameters as small as possible, and(iv) interpretability: the architecture should allow for possiblephysical explanations of what the machine actually learned.In the following, the interpretation of the neural network willbe achieved by an analysis of its internal weights.

A. Choice of input data

The choice of input data and its formatting is of majorimportance in the context of detecting phase transition withsupervised learning. Indeed, this method relies on the fact thatthe NN will be able to learn the relevant phase characteristicsbeing trained only in two extreme limits of the phase diagram.In this work, we directly input the eigenstates. One caveatis that this requires to specify a basis set in which to expand.Given an eigenstate | n (cid:5) , we expand it in the S z computationalbasis | i (cid:5) : | n (cid:5) = (cid:3) i c i | i (cid:5) and we denote p i ≡ | c i | . The choiceof this basis stems from the fact that the S z basis diagonal-izes the model (1) in the inﬁnite disorder limit. Moreover,basis-dependent quantities such as the inverse participationratios IPR( | n (cid:5) ) = (cid:3) i p i or associated participation entropies S Pq ( | n (cid:5) ) = − q log (cid:3) i p qi have been shown to capture differentbehaviors in the two phases [32,33,63]. At the technical level,this is also the basis where the largest system sizes can bestudied (as the Hamiltonian is quite sparse).However, the exponentially growing number of coefﬁcientswith system size will eventually lead to computational issuesfor the largest systems, e.g., for L =

24 each eigenstate hasmore than 2.8 millions coefﬁcients, which when multipliedby the number of eigenstates per disorder realization and thenumber of disorder realizations amounts to an extremely largeamount of input data. This would entail very slow trainingand necessary lossy compression implemented in the NNarchitecture with many pooling layers for instance. We choseto engineer this compression step by hand: our solution isto keep only the largest coefﬁcients of each eigenstate, wetruncate the N c largest p i . For illustration, we present in Fig. 1typical p i s for two values of disorder representative of theETH and MBL phases. Note that the basis states associated tothe largest coefﬁcients differ from one eigenstate to another.The fact that the input data are now of ﬁxed size (independentof L ) will be useful when training with data from multiplesizes at once in Sec. V.A lot of information is certainly lost in doing so, but weargue that it may not be crucial. On the MBL side, the localintegral of motions picture [64–67] indicates that eigenstates FIG. 1. Examples of N c =

256 highest probabilities p i for eigen-states in the middle of the spectrum ( (cid:2) = .

5) for different disorderrealizations and system sizes for two disorder values located stronglyin the ETH (left) and MBL (right) phases. have all typically the same structure coming from the strongdisorder limit, with a very strong coefﬁcient for a particular S z basis state (different for each eigenstate). On the ETH side,from random matrix theory, one expects a random coefﬁcientstructure with no correlation between basis states. As a furtherargument, one notes that IPR and participation entropies S Pq (for q (cid:2)

1) are dominated by large p i (independently ofwhich basis state corresponds to index i , one eigenstate fromanother).Data normalization is often recommended in conventionalmachine learning applications [68], as it helps a lot acceler-ating or even rendering possible the learning process. In ourcontext, we want to avoid as much as possible this featureengineering step since it can bias the data and possibly leadto false estimate of the transition point. As an example, ifwe kept the sample size constant from one system size toanother not by truncating but down-sampling [69] the largestprobabilities, this would eventually lead to an underestimationof the critical disorder because down-sampling would lead toa faster decay of the highest probabilities eventually makingthe eigenstates look more “MBL” than they are.Finally, we note that methods based on sampling of theeigenstates (such as quantum Monte Carlo) will pick up thebasis states precisely with a probability p i , and therefore thischoice of truncation may be useful in other contexts where theexact eigenstates cannot be reached.We obtain exact eigenstates at (cid:2) = . S z = S z = (cid:3) r S zr is conserved in this model). Weinsist on having a large, state-of-the-art dataset. For training,we use 1000 realizations of disorder per disorder strengthand 250 (respectively, about 150) realizations of disorder atprediction time for sizes L = , , , ,

22 (respec-tively, L = L (cid:3)

22 (respectively, for L = N c largestprobabilities p i (Secs. IV–VI). In Appendix C, we furthermoreconsider the coefﬁcients c i (i.e., restoring the sign) of thelargest N c amplitudes as inputs for the neural networks. , 224202 (2019) B. Choice of neural network architecture

As can be seen in Fig. 1, strongly ETH and MBL samplesare in fact linearly separable (a threshold value for the largest p i sufﬁces), hence the use of a neural network for this classi-ﬁcation task appears unnecessary at ﬁrst sight. However, ouractual task is to not only perform well on the well-deﬁnedlabeled region of the phase space but more importantly toassign labels to samples in the transition region. Thereforewe can view our work as a benchmark of the NN ability tocapture the relevant features and ﬁnite-size trends of the MBLtransition, or put another way whether NN are a good Ansatz for the classiﬁcation of phases present in model (1).The chosen neural-network architecture is simple to keepits interpretation possible to a reasonable extent, and its opti-mization is standard. Nevertheless, we provide details for clar-ity purposes (we also refer the interested readers to Ref. [71]for an introduction to machine learning for physicists).Artiﬁcial neural networks and in particular fully connectedfeed-forward neural networks are based on elementary unitscalled artiﬁcial neurons. These units are simple functions thattake a vector x of real values and transform it according to y = g ( W · x + b ) , (2)where g is a nonlinear so-called activation function, W is avector of weights and b a real weight called bias . Similarly,one can deﬁne a layer of artiﬁcial neurons which implementsa mapping between an input vector x and an output vector y as follows: y = g (cid:4) ˆ W · x + b (cid:5) , (3)where ˆ W is now a matrix of weights and b a vector and g isapplied element-wise on the input vector. This way, one cansuccessively stack layers of artiﬁcial neurons, building moreand more complex functions. There is some ﬂexibility in thechoice of g and frequently used activation functions includeReLU( x ) = (cid:6) x < x if x (cid:2) , (4)ELU( x ) = (cid:6) e x − x < x if x (cid:2) , (5)Softmax i ( x ) = e − x i (cid:3) j e − x j . (6)Since our task is to classify eigenstates as being ETH orMBL, our neural network is a function that takes as an inputan eigenstate in the form of a vector of size N c and outputs itslabel, that is (0,1) if it is ETH and (1,0) if MBL. The networkused in the following is shown schematically in Fig. 2, thereis one hidden layer of 32 neurons with g = ELU activationfunctions [see Eq. (5)] and a two-neuron output layer with g = Softmax activation function [see Eq. (6)]. Due to thesoftmax activation function, the output vector of the neuralnetwork is real and normalized to 1, thus it can be interpretedas probabilities to belong either to the ETH or the MBLphase. Training is done through stochastic gradient descentof a cross-entropy cost function with ADAM optimizer [72].For our NN architecture and for the input data chosen, wefound that the usual choice of ReLU activation functions (4)actually induces a bias in the location of the phase transition

FIG. 2. Neural-network architecture used in this work. through the appearance of dead neurons that limit the NNcapacity. This effect is described in length in Appendix A.The problem is avoided using other activation functions liketanh, leaky rectiﬁed linear unit (ReLU), or exponential linearunits (ELU) [73], the latter being used in this work.When used along with ELU units, we noticed that dropout[74] brings additional practical beneﬁts. This regularizationtechnique consists in randomly dropping connections betweenneurons (here between the hidden layer and the output layer)during training. This prevents neurons from coadaptating andallows to learn feature detectors that are indeed more indepen-dent of each other.

C. Model selection

In most traditional classiﬁcation problems, model selectionis done with respect to predictions on a labeled test set. Forinstance, a low test accuracy reveals that the model is unable togeneralize well to unseen samples. In our case, all consideredneural networks achieved 100% accuracy on training and testsets, but this only says that our data and chosen architectureare extremely good at distinguishing strongly ETH fromstrongly MBL samples. Given that our actual task is to assignlabels to samples from the transition region, we need to ﬁndother ways of discriminating the NN performance.One possibility is to ensure that the learned model achieveslow bias and low variance. On the one hand, we argue thatbias is low having checked that increasing the number ofhidden neurons does not change the predictions, rather in-creasing variance. On the other hand, variance is kept smallby choosing a relatively small number (32) of hidden neurons.Moreover, we can track the variance using cross-validation,i.e., obtaining multiple training instances from random ini-tialization of the NN weights and random partitioning of thetraining datasets (as we leave aside a fraction of the data in aseparate test set). In most cases, we observe a low and stable(during training) variance with a learning rate empiricallychosen at α = .

01, batch size of N = , 224202 (2019)FIG. 3. Color histogram of the output of an exemplary neuralnetwork trained with L =

18 data evaluated on 300 disorder real-izations for each disorder strength.

D. Output analysis

As can be seen in Fig. 3, we observed that the typicaldistribution of the NN output for a given system size L and disorder strength h is unimodal, the distribution havingvery low variance around 0 (1) in the ETH (MBL) phase.This motivates the choice of considering the fraction f ofsamples whose classiﬁcation conﬁdence is above 0.5 as agood quantity to faithfully describe the output of the neuralnetwork. Note that f is then the proportion of MBL-classiﬁedeigenstates from an ensemble of eigenstates coming from dif-ferent disorder realizations and classiﬁed by different traininginstances. We will clarify later how we compute the errorbars on this quantity (see Appendix B for more details on thedifferent sources of classiﬁcation variance).To the best of our knowledge, there is no theory whichdescribes the ﬁnite-size scaling (with L ) of the network output.Indeed, there is in general no expectation for which kind ofphysical observable (if any) the output will correspond to: fora standard continuous phase transition, f could for instancemimic the order parameter or its Binder cumulant (or anycombination thereof), which are known to display differentcritical behavior and ﬁnite-size effects. Various phenomeno-logical scalings have thus been used in the literature. Whenoutput curves for different L cross as a function of the controlparameter h , a natural scaling form is f = g [ L /ν ( h − h c )],with h c the critical disorder strength and ν the exponent as-sociated to the divergence of a correlation / localization length ξ ∼ | h − h c | − ν . This is the form that was used, e.g., in Ref. [1]for the Ising model, or for the MBL transition in Ref. [27].When curves for f do not cross, one can alternatively tryto deﬁne a ﬁnite-size pseudo critical point h c ( L ) (with somecriterion) and naturally assume a ﬁnite-size relation h c ( L ) − h c ∼ L − /ν . This was for instance used in Refs. [25,26]. Inour case, we ﬁnd (see Figs. 4 and 6) that the latter situationapplies (no crossing of curves) and thus assume the secondscaling form.In this case, there is a variety of options for the deﬁnitionof h c ( L ) as can be seen in earlier works: one can introducea conﬁdence threshold p c as performed in Ref. [35] and lookfor maximum of the confusion curves, alternatively one canpinpoint the transition when the mean output curves reach 0.5as in Ref. [26], or also consider the maximum of the confusionas deﬁned in Ref. [40]. In the following, we deﬁne h c ( L ) to be FIG. 4. Fraction of MBL-classiﬁed samples as a function ofdisorder strength for NN trained on a given system size L . Predictionsare averaged over 250 disorder realizations per disorder (with 100eigenstates per realization) and 50 training instances. Truncationorder is N c = h c ( L ) deﬁned as f ( h c ( L )) = . N c .the error bars on the ﬁnal estimates come from the ﬁtting procedure. the disorder strength at which half of samples are classiﬁed asMBL, meaning f ( h c ( L )) = . IV. SINGLE SYSTEM SIZE TRAINING

The most direct way to do a ﬁnite-size study of model (1)assisted by neural networks is to train one NN for each systemsize. Hence, we study the predictions of ﬁve neural networkstrained on data respectively from L = , , , , h = .

25 (respec-tively, h = .

0) for ETH (respectively, MBL)-labelled sam-ples for all system sizes.Figure 4 shows the fraction f of MBL-classiﬁed samples,i.e., if y θ, r , i ( h ) denotes the probability of eigenstate i fromdisorder realization r of being classiﬁed as MBL by theneural network θ , then f ( h ) = N θ N r N i (cid:3) θ, r , i (cid:7) ( y θ, r , i ( h ) − ),where (cid:7) is the Heaviside step function. As eigenvectors ofthe same disorder realization are correlated and the neuralnetworks have very low variance, we chose to bin quantitiesover all eigenstates of the same realization and all neuralnetworks, and then compute the standard error over thesebin averages (as performed in Ref. [32]), in order not tounderestimate error bars. Appendix B gives further details onthe variations of sample classiﬁcation from one NN instanceto another, including a discussion on predictions for individualeigenstates and their correlation with entanglement entropy.Several features can be distinguished: one is the existenceof a fully ETH region (where all samples are classiﬁed asETH) that extends from h = h = L , i.e., the crossover from ETH to MBL happens for , 224202 (2019)TABLE I. Finite-size scaling results with single-size training, asa function of truncation order N c .Truncation h c ν χ / dof N c =

64 3 . ± .

13 0 . ± .

07 0.03 N c =

128 3 . ± .

09 0 . ± .

06 0.13 N c =

256 3 . ± .

09 0 . ± .

05 0.32 higher disorder as L is increased. This behavior is in agree-ment with many other observables (such as spectral statistics,entanglement variance, dynamical spin fraction) used in thestandard analysis of this system [32], which also displayregions where ETH and MBL are clearly well identiﬁed,and a crossover region with a right shift (i.e., towards largerdisorder) of the ﬁnite-size estimate of the transition point withsystem sizes. Finite-size scaling.

We deﬁne the ﬁnite-size pseudo criticalpoint h c ( L ) as the disorder for which the fraction f of MBL-classiﬁed samples equals 0.5. The ﬁnite-size scaling resultsfor different truncation order N c = , ,

256 are sum-marized in Table I. In practice, we approximate the fraction f by a cubic polynomial around the putative h c ( L ) ﬁtted inthe interval [ h c ( L ) − w ; h c ( L ) + w ] with w = . h c (cid:2) . h c (cid:2) . ν (cid:2) .

22, which appearunreasonable. The underestimation of the critical disorderseemingly comes from the truncation preprocessing step, in-deed h c increases as N c increases. Note that we needed to takeaside L =

22 data for N c =

64 (otherwise having χ / dof = . Understanding the black box: internal parameters of thenetwork.

The most straightforward way to understand whatthe NN learnt is to directly look at their weights. Figure 5shows a typical family of weights obtained after trainingon L =

18 data. The neurons split up in two symmetricgroups: (i) [respectively, (ii)] half of the neurons weigh pos-itively (respectively, negatively) the largest probabilities p i FIG. 5. Weights of the ﬁrst hidden layer (32 neurons) of a typicaltraining instance of a NN trained on L =

18 data for N c = (corresponding to the smallest input indices) until input index i (cid:2)

40 then the next inputs are weighed negatively (respec-tively, positively). We observed that category (i) correspondsto neurons that activate most for an MBL-labeled sample, thuswe denote them MBL detectors. Likewise, category (ii) isresponsible for the detection of ETH features.Figure 5 points towards the relevance of the participationentropies S Pq for high values of q (as the largest p i are moreweighted by the NN), as a feature to classify the phasesand detect the transition. The particular relevance of the IPR( q =

2) was also noted in the support vector machine analysisof a MBL transition in Ref. [26].

Discussion.

One limitation of this setup is the possibilitythat a NN trained on a given L could learn (i.e., reproduce thefeatures of) a certain physical observable different from theone learned for a NN trained at a different L . Indeed, learninga certain classiﬁcation model depends for instance on the NNcapacity (number of layers / hidden neurons) relative to thecomplexity of the training dataset (that varies from one systemsize to another). Even more dramatically, Ref. [75] showedthat different physical observables are learned dependingon the amount of regularization, though this happened withsupport vector machines.In addition, we ﬁnd that a NN trained on a given systemsize in fact captures a model speciﬁc to this size. This canbe seen for instance in a principal component analysis of thenetwork weights (see Fig. 7 and its discussion in Sec. V).It has already been noticed that size-dependent features canindeed be captured [29,75]. It seems then illusory to achievemeaningful transfer learning like detecting the transition on L data from a model trained on L (cid:7)= L data. In the next section,we present a solution aimed at addressing these two issues. V. MULTIPLE SYSTEM SIZE TRAINING

Most neural network architectures require having inputdata of ﬁxed size. This comes from the fact that any fullyconnected layer needs a ﬁxed number of ingoing connections.The chosen formatting of input data (Sec. III A) with ﬁxed sizeallows us to use one unique NN to treat data from differentsystem sizes on equal footing. Including all system sizes inthe training dataset can be viewed as a regularization setupthat prevents detection of size-speciﬁc features. Also, we hopethat this will help the neural network to capture size-invariantfeatures, i.e., features in the thermodynamic limit, in particularclose to criticality.In the following, we investigate what a neural net-work trained on a dataset containing system sizes L = , , ,

22 all at once can learn and compare the resultsto the previous analysis (we refrain from using L =

24 data asnot enough samples are available for training). To do so, weneed to work at constant truncation order N c whatever systemsize is picked for training. The dataset has the same size asin the previous section, taking one fourth of samples from L =

16 data, one fourth from L =

18 and so on.Figure 6 shows the fraction of MBL-classiﬁed samplesdeﬁned in the previous section and displays similarities withFig. 4 regarding the existence of fully-ETH and fully-MBLregimes located at the same regions. Nevertheless a striking , 224202 (2019)FIG. 6. Fraction of MBL-classiﬁed samples as a function ofdisorder strength for a NN trained on multiple system sizes all at onceand evaluated on different system sizes. Predictions are averagedover 250 disorder realizations per disorder (with 100 eigenstates perrealization) and 50 training instances. Truncation order is N c = h c ( L ) deﬁnedas f ( h c ( L )) = . h ETH = .

25 and MBL-labeled data at h MBL = . , . , .

0, the error bars on the ﬁnalestimates come from the ﬁtting procedure. asymmetry from single-size training appears: a broadening ofthe curves in the crossover region.Note that the ﬁgure above features nontrivial transfer learn-ing: a neural network trained on L = , , ,

22 is askedto classify samples from system sizes L =

14 and 24 forwhich it has never seen any samples before. This highlightsone advantage of this multi-size training setup, namely itsreduced computational cost. It is indeed reduced by a factorproportional to the number of considered system sizes andnumber of retrainings, which can represent a huge saving incomputation time.

Finite-size scaling, and dependence on training region.

Weperform a ﬁnite-size scaling with varying training datasetswhich include MBL-labelled samples drawn from differentdisorder strengths h MBL = . , . , . h ETH = .

25, because wenoticed negligible change in the scaling for h ETH = . . TABLE II. Finite-size scaling results with multiple-size training,for different values of the training disorder used to label the MBLphase. “Averaged” refers to the method deﬁned in Sec. IV, “Individ-ual” is deﬁned in note [76]. Truncation order is N c = h c ν χ / dof h MBL = . . ± .

23 0 . ± .

08 0.24 h MBL = . . ± .

04 0 . ± .

01 – h MBL = . . ± .

98 1 . ± .

41 0.04 h MBL = . . ± .

10 0 . ± .

04 – h MBL = . . ± .

43 1 . ± .

64 0 . h MBL = . . ± .

21 1 . ± .

09 –

We found that including predictions obtained by transferlearning at L =

14 and 24 system sizes considerably improvethe results, in the sense that the ﬁtting procedure convergeswith rather small error bars on h c and ν . If L =

14 is takenaside, the error bars are multiplied by a factor of 4 and the ﬁtsdo not converge if no transfer learning is done (performing theﬁt only on L = , , , h c and ν but with error bars reduced by 10 using individual predictionof the critical point by each network (see the proceduredetailed in note [76]).The ﬁnite-size scaling analysis with varying trainingdatasets leads to a somewhat unexpected result: whereas itis generally considered that the h > h c ranging from h c (cid:2) h c (cid:2)

6, higher thanthe estimated value, and ν ranging from 0.5 to 1.5. Thisphenomenon can be rationalized with the following naiveargument: samples in the transition region will be classiﬁedMBL for lower disorders if the MBL-labelled samples arethemselves taken from region closer to the transition, thusshifting the transition point towards lower critical disorder.One can speculate that this ﬁnding actually echoes the nonuni-versal multifractal properties of the MBL phase recently no-ticed in Ref. [33] and based on the same type of input data.Indeed, one can associate a different multifractal dimension(decreasing with h ) to every h MBL : the h MBL dependencecould then be viewed as the manifestation of the varyingmultifractality in the MBL phase.To circumvent this issue, one may for instance includesamples from a range of disorder values all at once. However,we noticed that if we provide a training dataset contain-ing MBL-samples from h MBL = , ,

12, the NN tendto capture h MBL -averaged features of the dataset (see nextparagraph), i.e., leading to predictions similar to those of aNN trained at h MBL = Analysis of network internal parameters.

The two previoustraining setups—single and multiple system size training—give different critical estimates. We now try to understandthe source of these differences using a principal componentanalysis (PCA). The use of PCA has already proven to beuseful in many previous works (see, e.g., Refs. [20,77]). It isused here as a dimensional reduction procedure and allows torepresent the weights connected to one hidden neuron (a N c = L data leads tocapturing L -speciﬁc features. A hierarchy appears where the , 224202 (2019)FIG. 7. PCA representation of the weights learned after trainingon single system size datasets (Sec. IV), multiple system size datasets(Sec. V), supplemented by an L -adversarial component (Sec. VI).Each dot is a weight projected on the two principal axis of the PCAanalysis (which accounts for 90% of the total variance). Five traininginstances are included for each training case. weights corresponding to training at a given system size L arenext to the weights for L ± L = L = , , ,

22. This shows that the NNdoes not actually capture size-independent features (whichwould manifest by a uniform distribution of weights over the L -speciﬁc subspace of weights) but rather in a weaker way, ituncovers averaged features that are shared by all the providedsystem sizes. To corroborate this point, we trained a NN onsystem sizes L = , , ,

20 and we also notice the sameaveraging behavior, i.e., this time the NN captured featuressimilar to those captured for L =

16 and 18 trainings.

Discussion.

Previous section showed that if one neuralnetwork is trained only on one system size L , the respec-tive predictions of the L -speciﬁc NN cannot necessarily becomparable, hence rendering any ﬁnite-size-scaling procedurequestionable. The multiple-size training setup was expectedto produce more reliable predictions, but the results are some-what disappointing for different reasons.First, the obtained error bars on h c and ν are higher thanin the previous case (for the same ﬁtting procedure), pointingto the fact that the deﬁnition of h c ( L ) may not be the mostsuitable choice in this setup. One can for example deﬁne h c ( L ) as the disorder for which all samples are MBL, i.e.,when the fraction f ﬁrst reaches 1: this would affect, notdramatically but in a sensible way, the ﬁnal estimates of h c and ν . This difference in treating the ETH and MBL phasescould be justiﬁed by the various physical observations thatthe MBL transition displays asymmetries: see for instance theavalanche scenario which implies that a thermal bubble canmore easily destabilize a MBL sample than a MBL bubbledoes for an ETH sample [79], as well as that the critical pointis localized [80]. However, this is in our opinion a too strongbias and would go against our original goal of providing as FIG. 8. Neural network containing an adversarial componentapplied on the system size label. minimal physical input as possible. Second, the transitionpoint greatly depends on the region of the phase diagram usedfor training (this was also noticed in Refs. [6,40]). This isclearly a limitation of our setup since one would want the crit-ical parameters to be insensitive to the location of the trainingdata in the phase diagram. Third, the analysis of the weightsrevealed that this setup leads to the learning of an averagedmodel of the system sizes provided in the dataset. Next sec-tion aims at circumventing these limitations, in particular byintroducing a constrained setup that is designed to prevent theNN from capturing size-dependent features or size-averagedbehaviors.

VI. SYSTEM SIZE ADVERSARIAL TRAINING

The two previous sections pointed out the difﬁculty toﬁght against dataset dependence of the NN predictions. Thebest that we could obtain with the preceding architecturesis a NN that has captured averaged features of the trainingdataset when it contains data from multiple system sizes. Werecall that our objective was to use a diverse dataset to exposethe NN to rather different samples labeled the same to laterachieve good generalization either to the transition region oreven to unseen system sizes (for L =

24, for instance, sinceit becomes increasingly hard to generate a large amount oftraining data).Domain-adversarial neural networks (DANN) have beenintroduced in Ref. [81] in order to tackle domain adaptation,i.e., when the datasets at training and test / prediction timecome from similar but different distributions. The generalprinciple is to learn features that cannot discriminate betweenthe training (source) and test (target) domains. In practice,this is achieved through an adversarial setup that promotes theemergence of features that are (i) discriminative for the mainlearning task on the source domain and (ii) indiscriminatewith respect to the shift between the domains. This ideahas been recently used in two works [27,77] dealing withphase classiﬁcation, where the source domain consisted of theextremal region of the phase diagram and the target domainbeing the transition region.Expanding on these ideas, we exploit the speciﬁcity of thisscheme to force the NN to learn features that are insensitive tothe system sizes it has been trained on. In other words, thegoal is to use DANN to learn feature detectors that are L -invariant. In particular, a DANN contains two supplementarycomponents shown in Fig. 8: a system size classiﬁer anda gradient reversal layer. The latter component is the onlynonstandard part in this architecture and works by leaving , 224202 (2019) the input unchanged during forward-propagation and reversesthe gradient by multiplying it by a negative scalar duringthe back-propagation: this results in changing the sign of thegradient of the feature extractor parameters with respect tosize classiﬁer loss. That way, the common feature extractoris adjusted to make the task of the phase classiﬁer as easy aspossible while making that of the system size classiﬁer as hardas possible. If the network reaches equilibrium, the selectedfeatures are the best suitable to identify which phase a samplelies in, while containing no information about which systemsize it emanates from. Learning L-invariant features

In this section, we study the predictions of a DANNtrained on data from system sizes L = , , ,

22 all atonce. In particular, we analyze the effect of the adversarialcomponent compared to the setup of the previous section.The feature extractor part (see Fig. 8) is kept identical fromprevious sections (i.e., same hyperparameters, NN structure,etc.). The system size classiﬁer consists of 4 softmax neuronscorresponding to each provided system size and outputs whichcan be interpreted as the probability of a sample to be fromany of the given system size. The loss function contains nowan additional term that takes care of the size labels (secondterm in the following equation): L = (cid:2) x , y h , y L (cid:2) j = y h j ln (cid:4) f h j ( x ) (cid:5)(cid:7) (cid:8)(cid:9) (cid:10) Phase classiﬁer loss + (cid:2) j = y L j ln (cid:4) f L j ( x ) (cid:5)(cid:7) (cid:8)(cid:9) (cid:10) Size classiﬁer loss , (7)where y h (respectively, y L ) is the two-dimensional (re-spectively, four-dimensional) one-hot vector representing thephase label (respectively, the system size label) of sample x , f h (respectively, f L ) is the corresponding two-dimensional(respectively, four-dimensional) softmax output of the phaseclassiﬁer (respectively, the system size classiﬁer). Because ofthe adversarial component, the optimization process will keepthe size classiﬁer loss at much higher values (in practice ordersof magnitude larger) than the phase classiﬁer loss: the NNwill thus be discriminative for the phase classiﬁcation taskand indiscriminate with respect to the shift between the L -datadomains.Adversarial learning is generally considered to be a hardtask [82], for instance nonconvergence can occur with os-cillations of the optimized parameters. Training is known tobe very sensitive to the hyperparameter selections since anyunbalance between the two adversaries can lead to overﬁttingor other unwanted phenomena. In particular, we noticed thatthe weights of the feature extractor tended to take arbitrarilylarge values (increasing with training time). This has the effectof increasing the variance of the predictions from one traininginstance to another and may also cause overﬁtting.Therefore we found it crucial to add a L weight decayterm in the cost function (7), in the form μ | W | with W being the internal parameters of the feature extractor. Thisregularization technique requires however a good choice of μ . If μ is too large, the constraint is too strong and theoptimization procedure struggles to minimize the classiﬁerlosses. If μ is too small, the limitations presented above TABLE III. Finite-size scaling results using a DANN approachfor multiple-size training, as a function of the training disorder usedto label the MBL phase.Training data Method h c ν χ / dof h MBL = . . ± .

02 1 . ± .

40 0.21 h MBL = . . ± .

14 1 . ± .

06 – h MBL = . . ± .

05 1 . ± .

44 0.13 h MBL = . . ± .

19 1 . ± .

08 – h MBL = . . ± .

21 1 . ± .

52 0.01 h MBL = . . ± .

34 1 . ± .

15 – are not corrected, i.e., the model variance stays high. Afterﬁne-tuning, we found that μ = .

05 gives good results. Wechecked that the ﬁnite-size scaling of previous section withthe same regularization (weight decay with μ = .

05) givessame critical values with no better error bars. a. Finite-size scaling.

We perform the ﬁnite-size analysisof the NN predictions as before. The predictions for L = h c (cid:2) . ν (cid:2) . h MBL = . L = / MBL regions and from theintermediate region.

Interpretation of the network parameters.

The PCA rep-resentation of the DANN weights in Fig. 7 shows that thissetup allows some apparent independence of the model withrespect to system size, indeed, the weights are homogeneouslydistributed over L -speciﬁc weights subspaces.Similarly to Fig. 5 showing the weights connecting theinput layer to the ﬁrst hidden layer, Fig. 9 shows the weightsconnecting the feature extractor to the size classiﬁer. We arguethat the L invariance of the features is achieved by reaching thefollowing trivial equilibrium conﬁguration: any feature vector(output of the feature detector) is multiplied by the weightvector W L towards the L classiﬁer with L = , , , W L = = W L = = .. . Due tosoftmax normalization of the system size classiﬁer, this leadsto a L classiﬁcation of any sample of equal probability ofbelonging to any of the provided system sizes. Discussion.

This setup has proved to improve many lim-itations of the previously considered architectures, namelyreducing the training dataset as well as the training regiondependences. Nevertheless, we found that training a DANNis very sensitive to hyperparameters choices (regularizationparameter μ , etc.) and chosen NN structure (depth, etc.), , 224202 (2019)FIG. 9. Weights connecting the feature extractor to the system-size classiﬁer plotted against hidden layer neuron index of a traininginstance of a DANN trained on L = , , ,

22 data for N c = hence requiring very good calibration otherwise instabilitiescan rapidly occur. We also noticed greater variance of thepredictions from one instance to another (see Appendix B). VII. DISCUSSION OF RESULTS

The initial goal of this work was to attempt a ﬁnite-sizestudy of model (1) using neural networks. Our analysis re-vealed numerous difﬁculties: the scaling procedure appearedvery sensitive to the neural network hyperparameters (thespeciﬁc choice of activation function, the addition of dropoutor weight decay), as well as the imposed structure (whetheran adversarial component is added or not). In addition to that,there is no inherent criterion that allows us to discriminatethese different external choices, and as a matter of fact, wecan consider our analysis as a kind of model exploration (different machines with the same accuracy have differentways of solving the same task) rather than model selection (selecting the machine that achieves the highest accuracy on agiven task).The limitations also arose from the dependence on theparticular choice of training dataset, we highlighted that theNN predictions and ultimately the ﬁnite-size scaling actuallydepend on the region of the phase diagram used for training.Moreover when the training dataset includes data from severalsystem sizes, the NN tend to extract average features that donot permit accurate transfer learning. Including a constraintto ﬁght against this behavior (here in the form of L -invariantadversarial component) improves the situation to a certainextent at the cost of having to ﬁne-tune extra hyperparametersand thus potentially adding more bias in the ﬁnal estimates.These limitations occurred even though we provided thebest possible input data (i) giving directly the wave functionswith a controlled compression step and (ii) also in terms ofavailable system size (up to L =

24 in the MBL context).Nevertheless we ﬁnd that multi-size training of NN allows tograsp consistent ﬁnite-size trends based on a limited amountof disorder realizations. This points towards one of the NNadvantages, that is its reduced computational cost comparedto conventional methods. Another interesting point (discussedin Appendix B) which we discovered in investigating thecontributions to the variance of the prediction is that the

FIG. 10. Weights of the ﬁrst hidden layer (32 neurons) of twoexemplary training instances of a NN trained on L =

18 data withReLU activation units. Each color corresponds to one hidden neuron,its weights are connected to the input layer (here having N c = network output correlates quite well with the entanglemententropy.The ﬁnite-size scaling led to critical values of h c and ν always larger than conventional estimates: h c being around (cid:2) ν is about (cid:2) . .

5. The ﬁnite-size scalingof the MBL transition in model (1) (with random disorder)has been shown to be particularly difﬁcult, with system sizesavailable from exact diagonalization argued to be too smallto probe the correct criticality [62]. We do not ﬁnd thatthe machine learning analysis improves this situation, at leastwithin the setup and input data that we chose. In particularthere is no obvious reason to trust more the neural networksﬁnal results (again within the approach chosen in this work)than the ones reached within the conventional approach. Thegeneric trend that seems to emerge is towards a larger extentof the ETH phase, even though we emphasize that no criticalﬁeld h c ( L ) (obtained for a single system size L ) exceeds thevalue h c (cid:2) . h c ( L ) obtained from the NNanalysis.Our thorough ﬁnite-size study of this phase transition leadsto the conclusion that one always has to be aware of the mul-tiple bias that can possibly arise when using neural networksand its power might be limited to qualitative predictions ratherthan precise estimations, here for instance ﬁnite-size scaling.This is particularly relevant for phase transitions whose natureor universality class is unknown or debated and / or for whichthe input data has some limitations (e.g., in terms of the rangeof size accessible in our case).We ﬁnish with suggestions on possible improvements ofthis situation. In the case of the MBL transition studiedhere, one can certainly improve the quality of the output byproviding more physical knowledge of the transition in theinput data (such as when using the entanglement spectrum).Alternatively, one could keep the same generic input data(wave-function coefﬁcients) but use recent results [33] on theﬁnite-size scaling of participation entropies to try to build animproved network architecture as well as to better interpretthe outputs. For the more generic case of unknown phase , 224202 (2019)FIG. 11. (Left) Histograms of the number of ETH detectors (blue), MBL detectors (orange) and dead neurons (green) per training instance(having 32 hidden neurons) calculated over 50 NN instances. (Right) Fraction of MBL-classiﬁed samples of the two NN instances shown inFig. 10, averaged over 250 disorder realizations per disorder (and 100 eigenstates per realization). transition, further work is needed to ascertain the reliability ofﬁnite-size scaling within the neural network approach, ideallyproviding tools to construct and understand a generic ﬁnite-size scaling theory of the network prediction. Recent works[83–87] connecting the renormalization group and the neuralnetwork construction may be ﬁrst steps in this direction. ACKNOWLEDGMENTS

We thank Patrick Huembeli and Alexandre Dauphin forintroducing us to the domain adaptation thematics as well asEvert van Nieuwenburg, Nicolas Macé, and Nicolas Laﬂo-rencie for very useful comments. This work is supported bya grant from the Fondation CFM pour la Recherche, andbeneﬁted from the support of the project THERMOLOCANR-16-CE30-0023-02 of the French National ResearchAgency (ANR) and by the French Programme Investissementsd’Avenir under the program ANR-11-IDEX-0002-02, refer-ence ANR-10-LABX-0037-NEXT. We acknowledge PRACEfor awarding access to HLRS’s Hazel Hen computer based inStuttgart, Germany under Grant No. 2016153659, as well asthe use of HPC resources from CALMIP (Grants No. 2018-P0677 and No. 2019-P0677) and GENCI (Grant No. 2018-A0030500225). Our shift-invert [70] numerical calculationsare based on the linear algebra libraries:

PETSC [88,89],

SLEPC [90], and

STRUMPACK [91,92]. The neural network calcula-tions are performed with

TENSORFLOW [93].

APPENDIX A: HOW RELU ACTIVATION FUNCTIONSINDUCE BIASES IN THE ANALYSIS

ReLU activation functions are broadly used in the machinelearning community as well as in many of its applicationsto physics [3,27,35]. The main motivation comes from thefact that they do not suffer from saturation contrary to theirsigmoid or tanh counterparts. However it is known that train-ing with ReLU units can lead to dead neurons, i.e., neuronsthat output zero whatever input value comes in. Although thisphenomenon effectively allows to learn sparser representa-tions, in our case it drastically reduces the NN capacity to apoint such that the underﬁtting regime is actually reached.Figure 10 reveals the existence of a third category ofneurons: dead neurons that have zero weights for all incoming connections, which are invisible in Fig. 5 of the main textwhere we used ELU activation functions. The appearance ofsuch neurons comes along with great variability from one NNinstance to another, some instances having more MBL or ETHdetectors than others. This is visible in the histograms of theweights in the left panel of Fig. 11 that shows that there is onaverage 37% of MBL detectors, 16% of ETH detectors, and47% of dead neurons from statistics of 50 training instances.The right panel of Fig. 11 indeed shows how a variable ratioof MBL / ETH detectors shift the transition point and thereforeadd a bias that is only due to the NN structure.Furthermore, we ﬁnd that the addition of dropout intothe NN with Relu activation is very problematic due to thephenomenon shown in Fig. 12: if one follows the NN outputof individual samples during training, dropout induces hugevariations. This can be explained by the unbalance betweenthe number of ETH and MBL detectors, indeed droppingETH detectors will greatly impact the classiﬁcation since theyalready are less numerous on average than MBL detectors. Inaddition, one necessarily has to stop training at some step andthese great variations prevent any choice of stopping criterion.With ELU units on the other hand, we observed that there ison average the same number of MBL and ETH detectors, thusdropping randomly detectors does not impact on average theclassiﬁcation.

FIG. 12. NN output of 200 eigenstates (different colors) plottedagainst training steps (left) without dropout and (right) with dropoutat h = .

0, for a NN trained on L =

16 data.224202-11UGO THÉVENIAUT AND FABIEN ALET PHYSICAL REVIEW B , 224202 (2019)FIG. 13. Histogram of y θ ( r ) and y ( r ) as deﬁned in the maintext obtained with 250 disorder realizations. Predictions are obtainedfrom L =

18 data and 50 NN trained according to the DANN setup.

APPENDIX B: SOURCES OF CLASSIFICATIONVARIANCE, PREDICTION FOR INVIDUAL EIGENSTATE

As noted in Refs. [35,36], the NN approach allows fora direct low-resolution analysis of the transition, i.e., at thelevel of eigenstates. In this Appendix, we highlight severalinteresting features based on the analysis of prediction forindividual eigenstate. a. Eigenstate-to-eigenstate, sample-to-sample variance.

First, we consider variations of classiﬁcation from one dis-order realization to another. For a given neural network θ , westudy the distribution of classiﬁcations across disorder realiza-tions. As done in the main text, we average the classiﬁcation(0 meaning ETH, 1 MBL) of individual eigenstates sharingthe same disorder realization r denoted as y θ ( r ). Figure 13shows an histogram y θ ( r ) for a typical NN θ for 250 disorder FIG. 14. MBL raw conﬁdence as a function of entanglemententropy for 100 individual eigenstates of 100 different disorderrealizations at h = .

0, for different system sizes in the multi-sizetraining setup. FIG. 15. Average of one eigenstate per disorder realization over50 training instances ( y r as deﬁned in the main text) as a functionof disorder strength and realization number for (top) a multi-sizetraining setup and (bottom) a DANN setup with predictions obtainedat L = realizations. We have checked that this picture is stable for alltraining instances and for any of the considered setups.For disorder strengths slightly lower (respectively, higher),i.e., at h = . h = .

0) than the crossover point(here around h = . L = h = . y ( r ). We chose results for the setup withadversarial component which displays the most variance fromone training instance to the other (this is quantiﬁed in nextparagraph). The distribution roughly follows the distributionof y θ ( r ) meaning that the same physical picture explainedabove persists for all training instances on average. FIG. 16. Average sign averaged over 200 disorder realizationsper disorder for different system sizes.224202-12EURAL NETWORK SETUPS FOR A PRECISE DETECTION … PHYSICAL REVIEW B , 224202 (2019)TABLE IV. Finite-size scaling results when inputing signed coefﬁcients c i to the NN.Setup Training data Method h c ν χ / dofMultisize h MBL = . . ± .

28 1 . ± .

92 0.11Multisize h MBL = . . ± .

33 1 . ± .

13 –DANN h MBL = . . ± .

83 2 . ± .

64 0.07DANN h MBL = . . ± .

57 2 . ± .

24 – b. Correlation of individual eigenstate prediction with itsentanglement entropy.

The fact that, close to the transition,the network predicts both ETH and MBL eigenstates inthe same disorder realization at the same energy density isreminiscent of what was observed in Ref. [94], where abimodal distribution of entanglement entropy was observedalso at the individual disorder realization level close to thetransition. This suggests looking at the correlation betweenthe prediction of each eigenstate and its entanglement entropy.This correlation is represented in Fig. 14 for four differ-ent sizes in the multi-size training setup for h = .

0. Weclearly see that eigenstates with low (high) entanglementare systematically classiﬁed as MBL (ETH) and maximizing(minimizing) the MBL conﬁdence to be 1 (0). In agreementwith Ref. [94], we have checked that an important numberof disorder realizations contain at the same time eigenstateswith low and high entanglement (and correspondingly highand low MBL conﬁdence). For each system size, there existsa crossover region for intermediate values of entanglemententropy for which the full range of MBL conﬁdence canbe found. This gives rise to the increased variance of theprediction near the transition region, and most certainly to thehigher error bars observed there. c. NN variance.

As pointed out in the main text, weobserved the largest model variance in the DANN setup(Sec. VI). To show this, we pick one eigenstate per disorderrealization and compute its average classiﬁcation over 50training instances denoted by y r . We do the same for the other250 different disorder realizations and the result is showedin Fig. 15 as a function h for both multisize training and L -adversarial training.For the multisize training, there is almost no variance: allNN classify the same eigenstate almost identically, which canbe observed as predictions are most of the time close to 1or 0 in the left panel of Fig. 15. For the DANN setup, theﬂuctuations due to disorder realizations are supplemented byﬂuctuations due to NN classiﬁcations. In effect, Fig. 15 showsthat a given eigenstate can sometimes be classiﬁed as ETH andMBL for two different training instances, with averages more often closer to intermediate values ∼ .

5. Figure 15 also allowsto detect disorder realizations for which the average predictionis markedly different from others for a given strength ofdisorder. Quite interestingly, the NN predictions presents acertain asymmetry in the transition, with more MBL-classiﬁedsamples on the ETH side than ETH samples on the MBLside.

APPENDIX C: WORKING WITH AMPLITUDE c i For a given eigenstate | n (cid:5) = (cid:3) i c i | i (cid:5) , in the main text, wechose to provide the probabilities p i ≡ | c i | as input to theNN. We investigate here whether restoring the signs, i.e.,taking directly the amplitude c i as input data, would allow fora better estimate of the transition. Indeed, this input containsmore information than contained in p i input, which could po-tentially lead to less biased ﬁnite-size estimates. Note that asthe Hamiltonian in Eq. (1) is real and symmetric, all c i are real(up to degeneracies which can occur only exceptionally due tothe random part in the Hamiltonian). As before, we keep onlythe N c highest amplitudes c i (sorted by their absolute value).For illustration, we show in Fig. 16 the average sign deﬁned assign( | n (cid:5) ) = (cid:11)(cid:11) (cid:3) N c i = sign( c i ) | c i | (cid:11)(cid:11)(cid:3) N c i = | c i | (C1)for different system sizes. For small disorder, the average signstays small, close to zero. We indeed expect eigenfunctionscoefﬁcients to be Gaussian distributed around zero in theETH phase. As disorder strength increases, the average signgrows more rapidly for lower system sizes until it eventuallyreaches 1 in the high-disorder limit. Again, we understandthis limit very well, as each eigenstate is dominated by asingle coefﬁcient in the S z basis in the inﬁnite disorder limit.We take the multisize training architecture and the samehyperparameters as in Secs. V and VI and attempt a ﬁnite-sizescaling analysis. Results are summarized in Table IV. Despiteinputing in principle more physical information, we ﬁnd nonoticeable improvement on the error bars of critical estimates,within the setup used. [1] J. Carrasquilla and R. G. Melko, Nat. Phys. , 431 (2017).[2] E. P. L. van Nieuwenburg, Y.-H. Liu, and S. D. Huber, Nat.Phys. , 435 (2017).[3] P. Broecker, J. Carrasquilla, R. G. Melko, and S. Trebst, Sci.Rep. , 8823 (2017).[4] P. Broecker, F. F. Assaad, and S. Trebst, arXiv:1707.00663.[5] T. Ohtsuki and T. Ohtsuki, J. Phys. Soc. Jpn. , 123706(2016). [6] K. Ch’ng, J. Carrasquilla, R. G. Melko, and E. Khatami, Phys.Rev. X , 031038 (2017).[7] W. Hu, R. R. P. Singh, and R. T. Scalettar, Phys. Rev. E ,062122 (2017).[8] Y.-H. Liu and E. P. L. van Nieuwenburg, Phys. Rev. Lett. ,176401 (2018).[9] M. Matty, Y. Zhang, Z. Papic, and E.-A. Kim, Phys. Rev. B ,155141 (2019).224202-13UGO THÉVENIAUT AND FABIEN ALET PHYSICAL REVIEW B , 224202 (2019)[10] Y. Zhang, A. Mesaros, K. Fujita, S. D. Edkins, M. H. Hamidian,K. Ch’ng, H. Eisaki, S. Uchida, J. C. S. Davis, E. Khatami, andE.-A. Kim, Nature (London) , 484 (2019).[11] S. Ghosh, M. Matty, R. Baumbach, E. D. Bauer, K. A. Modic,A. Shekhter, J. A. Mydosh, E.-A. Kim, and B. J. Ramshaw,arXiv:1903.00552.[12] G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer, R. Melko,and G. Carleo, Nat. Phys. , 447 (2018).[13] G. Carleo and M. Troyer, Science , 602 (2017).[14] Y. Nomura, A. S. Darmawan, Y. Yamaji, and M. Imada, Phys.Rev. B , 205152 (2017).[15] Z. Cai and J. Liu, Phys. Rev. B , 035116 (2018).[16] I. Glasser, N. Pancotti, M. August, I. D. Rodriguez, and J. I.Cirac, Phys. Rev. X , 011006 (2018).[17] H. Saito, J. Phys. Soc. Jpn. , 093001 (2017).[18] X. Liang, W.-Y. Liu, P.-Z. Lin, G.-C. Guo, Y.-S. Zhang, and L.He, Phys. Rev. B , 104426 (2018).[19] J. Liu, Y. Qi, Z. Y. Meng, and L. Fu, Phys. Rev. B , 041101(R)(2017).[20] L. Wang, Phys. Rev. E , 051301(R) (2017).[21] X. Y. Xu, Y. Qi, J. Liu, L. Fu, and Z. Y. Meng, Phys. Rev. B ,041119(R) (2017).[22] L. Zdeborová, Nat. Phys. , 420 (2017).[23] S. Das Sarma, D.-L. Deng, and L.-M. Duan, Phys. Today (3),48 (2019).[24] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N.Tishby, L. Vogt-Maranto, and L. Zdeborová, arXiv:1903.10563.[25] Z. Li, M. Luo, and X. Wan, Phys. Rev. B , 075418 (2019).[26] W. Zhang, L. Wang, and Z. Wang, Phys. Rev. B , 054208(2019).[27] P. Huembeli, A. Dauphin, P. Wittek, and C. Gogolin, Phys. Rev.B , 104106 (2019).[28] S. Efthymiou, M. J. S. Beach, and R. G. Melko, Phys. Rev. B , 075113 (2019).[29] M. J. S. Beach, A. Golubeva, and R. G. Melko, Phys. Rev. B ,045207 (2018).[30] P. Suchsland and S. Wessel, Phys. Rev. B , 174435 (2018).[31] Y. Zhang, R. G. Melko, and E.-A. Kim, Phys. Rev. B , 245119(2017).[32] D. J. Luitz, N. Laﬂorencie, and F. Alet, Phys. Rev. B ,081103(R) (2015).[33] N. Macé, F. Alet, and N. Laﬂorencie, Phys. Rev. Lett. ,180601 (2019).[34] J. A. Kjäll, J. H. Bardarson, and F. Pollmann, Phys. Rev. Lett. , 107204 (2014).[35] F. Schindler, N. Regnault, and T. Neupert, Phys. Rev. B ,245134 (2017).[36] J. Venderley, V. Khemani, and E.-A. Kim, Phys. Rev. Lett. ,257204 (2018).[37] Y.-T. Hsu, X. Li, D.-L. Deng, and S. Das Sarma, Phys. Rev.Lett. , 245701 (2018).[38] H. Théveniaut, Z. Lan, and F. Alet, arXiv:1902.04091.[39] S. Durr and S. Chakravarty, Phys. Rev. B , 075102(2019).[40] E. van Nieuwenburg, E. Bairey, and G. Refael, Phys. Rev. B ,060301(R) (2018).[41] E. V. H. Doggen, F. Schindler, K. S. Tikhonov, A. D. Mirlin,T. Neupert, D. G. Polyakov, and I. V. Gornyi, Phys. Rev. B ,174202 (2018). [42] M. Schreiber, S. S. Hodgman, P. Bordia, H. P. Lüschen, M. H.Fischer, R. Vosk, E. Altman, U. Schneider, and I. Bloch,Science , 842 (2015).[43] J. Smith, A. Lee, P. Richerme, B. Neyenhuis, P. W. Hess, P.Hauke, M. Heyl, D. A. Huse, and C. Monroe, Nat. Phys. ,907 (2016).[44] J.-Y. Choi, S. Hild, J. Zeiher, P. Schauß, A. Rubio-Abadal,T. Yefsah, V. Khemani, D. A. Huse, I. Bloch, and C. Gross,Science , 1547 (2016).[45] H. P. Lüschen, P. Bordia, S. Scherg, F. Alet, E. Altman, U.Schneider, and I. Bloch, Phys. Rev. Lett. , 260401 (2017).[46] D. A. Abanin, E. Altman, I. Bloch, and M. Serbyn, Rev. Mod.Phys. , 021001 (2019).[47] F. Alet and N. Laﬂorencie, C. R. Phys. , 498 (2018).[48] D. A. Abanin and Z. Papi´c, Ann. Phys. , 1700169 (2017).[49] R. Nandkishore and D. A. Huse, Ann. Rev. Cond. Matt. , 15(2015).[50] E. Altman and R. Vosk, Ann. Rev. Cond. Matt. , 383 (2015).[51] J. M. Deutsch, Phys. Rev. A , 2046 (1991).[52] M. Srednicki, Phys. Rev. E , 888 (1994).[53] M. Žnidariˇc, T. Prosen, and P. Prelovšek, Phys. Rev. B ,064426 (2008).[54] A. Pal and D. A. Huse, Phys. Rev. B , 174411 (2010).[55] P. T. Dumitrescu, R. Vasseur, and A. C. Potter, Phys. Rev. Lett. , 110604 (2017).[56] R. Vosk, D. A. Huse, and E. Altman, Phys. Rev. X , 031032(2015).[57] A. C. Potter, R. Vasseur, and S. A. Parameswaran, Phys. Rev. X , 031033 (2015).[58] A. Goremykina, R. Vasseur, and M. Serbyn, Phys. Rev. Lett. , 040601 (2019).[59] P. T. Dumitrescu, A. Goremykina, S. A. Parameswaran,M. Serbyn, and R. Vasseur, Phys. Rev. B , 094205(2019).[60] A. Morningstar and D. A. Huse, Phys. Rev. B , 224205(2019).[61] A. Chandran, C. R. Laumann, and V. Oganesyan,arXiv:1509.04285.[62] V. Khemani, D. N. Sheng, and D. A. Huse, Phys. Rev. Lett. ,075702 (2017).[63] A. D. Luca and A. Scardicchio, EPL , 37003 (2013).[64] D. A. Huse, R. Nandkishore, and V. Oganesyan, Phys. Rev. B , 174202 (2014).[65] M. Serbyn, Z. Papi´c, and D. A. Abanin, Phys. Rev. Lett. ,127201 (2013).[66] L. Rademaker, M. Ortuño, and A. M. Somoza, Ann. Phys. Proceedings of the 32Nd Inter-national Conference on International Conference on MachineLearning - Volume 37 , ICML’15 (Journal of Machine LearningResearch, 2015), pp. 448–456.[69] Down-sampling a list of p i means keeping only every M th ele-ment. For instance, if we take a list of 1024 highest probabilities p i , down-sampling it by a factor of 4 means that we consider thelist p i containing 256 elements.[70] F. Pietracaprina, N. Macé, D. J. Luitz, and F. Alet, SciPost Phys. , 045 (2018).224202-14EURAL NETWORK SETUPS FOR A PRECISE DETECTION … PHYSICAL REVIEW B , 224202 (2019)[71] P. Mehta, M. Bukov, C.-H. Wang, A. G. R. Day, C. Richardson,C. K. Fisher, and D. J. Schwab, Phys. Rep. , 1 (2019).[72] D. P. Kingma and J. Ba, arXiv:1412.6980.[73] D.-A. Clevert, T. Unterthiner, and S. Hochreiter,arXiv:1511.07289.[74] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.Salakhutdinov, J. Mach. Learn. Res. , 1929 (2014).[75] P. Ponte and R. G. Melko, Phys. Rev. B , 205146 (2017).[76] We apply the procedure of Sec. IV on each neural networkseparately then we average the critical estimates and calculateappropriate error bars. In other words, having a collection ofquantities f θ for each neural network θ ( f θ is the MBL fractionaveraged over eigenstates and disorder realizations predicted byneural network θ ), they give rise to a collection of h θ c ( L ) deﬁnedas f θ ( h θ c ( L )) = . h θ c , ν θ c witherror bars for each of them. The ﬁnal estimates h c and ν c showedin Table II are simply the average of these NN-estimates andthe ﬁnal error bar is calculated from the individual errors δ h θ c as ( δ h c ) = (cid:3) θ ( δ h θ c ) / N θ . Note that we can reduce the errorbars as much as we want at the cost of having more traininginstances.[77] P. Huembeli, A. Dauphin, and P. Wittek, Phys. Rev. B ,134109 (2018).[78] L. van der Maaten and G. Hinton, J. Mach. Learn. Res. , 2579(2008).[79] D. J. Luitz, F. Huveneers, and W. De Roeck, Phys. Rev. Lett. , 150602 (2017).[80] T. Thiery, F. Huveneers, M. Müller, and W. De Roeck, Phys.Rev. Lett. , 140601 (2018).[81] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F.Laviolette, M. March, and V. Lempitsky, J. Mach. Learn. Res. , 1 (2016).[82] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet,arXiv:1711.10337. [83] P. Mehta and D. J. Schwab, arXiv:1410.3831.[84] M. Koch-Janusz and Z. Ringel, Nat. Phys. , 578 (2018).[85] S.-H. Li and L. Wang, Phys. Rev. Lett. , 260601 (2018).[86] S. Iso, S. Shiba, and S. Yokoo, Phys. Rev. E , 053304(2018).[87] P. M. Lenggenhager, D. E. Gökmen, Z. Ringel, S. D. Huber,and M. Koch-Janusz, arXiv:1809.09632.[88] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, in Modern Software Tools in Scientiﬁc Computing , edited by E.Arge, A. M. Bruaset, and H. P. Langtangen (Birkhäuser Press,Cambridge, 1997), pp. 163–202.[89] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K.Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik,M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S.Zampini, H. Zhang, and H. Zhang, PETSc Users Manual, Tech.Rep. ANL-95/11 - Revision 3.8 (Argonne National Laboratory,2017).[90] V. Hernandez, J. E. Roman, and V. Vidal, ACM Trans. Math.Software , 351 (2005).[91] P. Ghysels, X. Li, F. Rouet, S. Williams, and A. Napov, SIAMJ. Sci. Comput. , S358 (2016).[92] P. Ghysels, X. S. Li, C. Gorman, and F. H. Rouet, in (IEEE, Piscataway, NJ, 2017), pp. 897–906.[93] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J.Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P.Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X.Zheng, TensorFlow: a system for large-scale machine learning,in Proceedings of the 12th USENIX Conference on OperatingSystems Design and Implementation, OSDI’16 (ACM, NewYork, 2016), pp. 265–283.[94] X. Yu, D. J. Luitz, and B. K. Clark, Phys. Rev. B94