[PDF] Interpretable and unsupervised phase classification

Abstract

Fully automated classification methods that yield direct physical insights into phase diagrams are of current interest. Here, we demonstrate an unsupervised machine learning method for phase classification which is rendered interpretable via an analytical derivation of its optimal predictions and allows for an automated construction scheme for order parameters. Based on these findings, we propose and apply an alternative, physically-motivated, data-driven scheme which relies on the difference between mean input features. This mean-based method is computationally cheap and directly interpretable. As an example, we consider the physically rich ground-state phase diagram of the spinless Falicov-Kimball model.

Full PDF

IInterpretable and unsupervised phase classiﬁcation

Phase diagrams and phase transitions are ofparamount importance to physics [1–3]. While typicalmany-body systems have a large number of degrees offreedom, their phases are usually characterized by a smallset of physical quantities like response functions or orderparameters. However, the identiﬁcation of phases andtheir order parameters is often a complex problem involv-ing a large state space [4, 5]. Machine learning methodsare apt for this task [3, 6–14] as they can deal with largedata sets and eﬃciently extract information from them.Ideally, such machine learning methods should not re-quire any a priori knowledge about the phases, i.e., themethods should be unsupervised [7, 9, 15–29]. Yet, theyshould also allow for a straightforward physical insightinto the character of phases. Signiﬁcant progress hasbeen made recently [20, 21, 25, 26], but some open issueswith interpretability remain. Thus, unsupervised and in-terpretable phase classiﬁcation stays a challenging, buthighly rewarding task.A good example of both progress in the ﬁeld and rele-vant issues regarding interpretability is the unsupervisedmethod introduced in Ref. [18]. This approach is basedon a predictive model trained to infer the parameters ofa physical system from input data – obtained by exper-imental measurements or numerical simulations – thatcharacterize the system’s state. In the following, we re-fer to this approach as the prediction-based method . Thepredictions for the system parameters in the prediction-based method are changing most strongly near phaseboundaries. Hence, the vector-ﬁeld divergence of the de-viations of the predicted system parameters from theirtrue values serves as an indicator (label I in Fig. 1) ofphase boundaries.The prediction-based method was hitherto suc-cessfully applied to symmetry-breaking [18], driven-dissipative [18], quantum [19], and topological phasetransitions [19, 30] in various systems. The prediction-based method requires a predictive model with suﬃcientexpressive power [31, 32] to resolve diﬀerent phases. Theresulting phase classiﬁcations can therefore be hard tointerpret if highly expressive models, such as deep neuralnetworks (DNNs) [32], have to be used [20, 26]. Addition- ally, the training of DNNs is computationally demanding.For complicated phase diagrams with a large number ofphases, the applicability of the prediction-based methodremains to be demonstrated.Herein, we render the prediction-based method fullyinterpretable by deriving its optimal predictions. To de-vise local order parameters for the predicted phases, weemploy linear models [25, 26, 33, 34] to infer the sys-tem parameters. Such local order parameters distinguishneighboring phases. As the key result of this Letter,we demonstrate a physically motivated, general, data-driven, unsupervised phase classiﬁcation approach whichrelies on the diﬀerence between mean input features asan indicator for phase transitions (Fig. 1). In the follow-ing, we refer to this approach as the mean-based method .The mean-based method eliminates the need for a pre-dictive model, is computationally cheap, and directly in-terpretable. calculate indicator generatesamples FIG. 1. Our workﬂow to predict a phase diagram with indi-cators I for phase transitions. A set of samples { S i } is gener-ated for ﬁxed system parameters p i . Based on these samples,a scalar indicator for phase transitions, I ( p i ), is calculated.This indicator highlights the boundaries (red) between phases(grey). Diﬀerent unsupervised phase classiﬁcation schemesare established via diﬀerent indicators. As a physical system, we consider the two-dimensionalspinless Falicov-Kimball model (FKM) [35–37]. Itsground-state phase diagram features a large number ofdiﬀerent phases, e.g., charge stripes or various phase sep-arations [38–40]. The features of these phases play animportant role in the investigation of numerous physi-cal phenomena, e.g., metal-insulator and valence transi- a r X i v : . [ c ond - m a t . d i s - nn ] O c t FIG. 2. (a) Sketch of the ground-state phase diagram of the spinless FKM. Red-dashed lines highlight the boundaries of thephases with (1) segregated, (2) diagonal, and (3) axial orderings. For each ordering (1)–(3), an example of a typical ground-state conﬁguration w ( L = 20) is shown on top. Here, the absence ( w ,i = 0) and presence ( w ,i = 1) of an f particle atlattice site i is denoted by a white or black square, respectively. (b)-(c) ∇ p · δp [Eq. (2)] based on the predictions of a DNNtrained using | F | as input in the (b) noise-free and (c) noisy case. (d) Illustration of the correlation functions that measuresquare ( κ sq n ), axial ( κ ax n ), and diagonal ( κ di n ) correlation at a distance n from the origin [cf. Eq. (4) and red and blue squares].(e)-(f) Correlation indicator ∆¯ κ [Eq. (5)] in the (e) noise-free and (f) noisy case. Both ∇ p · δp and ∆¯ κ serve as indicatorsfor phase transitions (Fig. 1). The dashed line in (c),(f) marks the cut along ρ = 63 / ≈ .

16 analyzed in Fig. 3. Exampleconﬁgurations (1)–(9) for some of the largest predicted regions of stability are shown on top. tions [41–45], pattern formations in ultracold atoms inoptical lattices [46–49], localization and correlations [50–56], or various nonequilibrium phenomena [57–64]. Hith-erto, the classiﬁcation of ground-state phases in the FKMwas a manual and – due to the richness of the phase dia-gram [38–40] – lengthy and cumbersome task. The com-plexity of the FKM phase diagram makes it a challengingexample for unsupervised and interpretable phase classi-ﬁcation methods.The Hamiltonian of the spinless FKM is H = − t (cid:88) (cid:104) ij (cid:105) ( d † i d j + d † j d i ) + U (cid:88) i d † i d i f † i f i . (1)Here, t is the hopping integral (energy unit throughoutthis work), U is the on-site Coulomb interaction strength, f † i ( f i ) and d † i ( d i ) are the creation (annihilation) oper-ators of heavy ( f ) and light ( d ) fermions at lattice site i . The number operator n f,i = f † i f i commutes with theHamiltonian for all i ; we can replace it by its eigenval-ues w i ∈ { , } . The ground state is thus determined by the classical f -particle conﬁguration w = { w i } thatminimizes the system energy. We focus on the “neutral”case [38], characterized by an equal density of heavy andlight particles ρ ≡ N f /L = N d /L . Here, N f ( N d ) isthe total number of heavy (light) particles and L = 20 –which we ﬁx throughout this Letter – is the linear size ofthe square two-dimensional lattice with periodic bound-ary conditions (plane symmetry group: p m [65]).Figure 2(a) shows a sketch of the expected phase dia-gram in two-dimensional parameter space [38]. It high-lights the regions of stability of three main types of or-derings, namely, (1) segregated, (2) diagonal, and (3) ax-ial orderings. A multitude of other phases with smallerstability regions are expected to be present in the fulldiagram [38, 39].We determine the ground-state conﬁguration w ap-proximately for a given p ≡ ( U, ρ ) using an adaptivesimulated annealing algorithm (Sec. S1 in SupplementalMaterial (SM) [66]), where ρ ranges from 1 /L to half-ﬁlling (∆ ρ = 1 /L ) and U ranges from 1 to 8 (∆ U = 0 . p we performed up to 64 independent simu-lations. Because simulated annealing does not alwaysconverge to the ground state for large systems, we in-vestigate two cases: a “noise-free” case where the bestestimate w is taken as the ground-state and a “noisycase” where we take into account 10 conﬁgurations withthe lowest energies at each p . The latter case is impor-tant for checking the robustness of our methods and isparticularly relevant for experiments, where thermal ﬂuc-tuations are inevitable.We start the analysis of the phase diagram using theprediction-based method with DNNs (Sec. S2 in SM [66]).Since the predictions ˆ p ≡ (cid:16) ˆ U , ˆ ρ (cid:17) are most susceptiblenear the phase boundaries, maxima in the vector-ﬁelddivergence of δp ≡ ˆ p − p given by ∇ p · δp = ∂δU∂U + ∂δρ∂ρ (2)serve as an indicator I ( p ) (Fig. 1) of phase boundaries.We train the DNN to predict p = ( U, ρ ) using as inputthe magnitude of the two-dimensional discrete Fouriertransform | F | . Figure 2(b) and (c) shows the obtainedphase diagram in the noise-free and the noisy case, re-spectively. The usage of | F | instead of w results in ashorter training time because data augmentation by lat-tice translations is not necessary. Moreover, it yields animproved signal-to-noise ratio for ∇ p · δp , because thepeak in | F | at k = (0 ,

0) corresponds to N f , i.e., ρ isdirectly fed into the DNN. Consequently, the vector ﬁeld δp exhibits a horizontal structure (Sec. S6 in SM [66]),meaning that ρ is predicted with near-perfect accuracy inboth cases. The horizontal structure of δp implies thatmaxima in ∇ p · δp indicate phase transitions along U atﬁxed ρ . The ( U, ρ ) parameter space can thus be analyzedwith cuts along U – we will do this later on.The largest connected regions with a negative diver-gence signal in Fig. 2(b) coincide with the three mainregions within the sketched phase diagram displayed inFig. 2(a) and their character can be conﬁrmed by sim-ple order parameter analysis (Sec. S3 in SM [66]). Thus,the prediction-based method correctly identiﬁes the ex-pected main characteristics of the phase diagram. How-ever, the phase boundaries are not reproduced with alarge contrast and it is not easy to identify stability re-gions, besides the ones of the main orderings [1 , , x of a model trained to minimize a mean-square-error loss(derivation in Sec. S2 in SM [66]) isˆ p opt ( x ) = (cid:80) i P i ( x ) p i (cid:80) i P i ( x ) . (3)Here, the sum runs over all sampled points { p i } in pa-rameter space. The probability of drawing the input x at p i is governed by the distribution P i . By consideringidentical, non-zero values of P i ( x ) within a particularregion of parameter space and zero outside, we get theoptimal predictions for the noise-free case. The vector-ﬁeld divergence ∇ p · δp [Fig. 2(b)] matches the one ofan optimal predictive model. We, therefore, successfullyrendered the phase classiﬁcation of the prediction-basedmethod interpretable. However, we still lack physical in-sights into the character of the predicted phases.An important observation concerning the noise-freecase is that the method predicts a phase transition when-ever neighboring conﬁgurations in U (at ρ = const.) arenot related by transformations of p m . The optimalmodel predictions for all points within such a phase areplaced at its center of mass. This results in a signal of ∇ p · δp = − ∇ p · δp attwo points p that constitute a phase boundary serves asa measure of the mean extent of the two correspondingphases (along U ). Consequently, the large extent of thesegregated phase in parameter space is the cause for thelarge, isolated maxima in ∇ p · δp within it, Fig. 2(b).Such isolated points are physically meaningless.The noisy case can be understood from a single line-scan. At ρ = 63 /

400 [dashed line in Fig. 2(c)] a transi-tion from a non-segregated to a segregated ordering oc-curs. Figure 3(a),(d) shows that the predictions ˆ U andthe corresponding divergence ∂δU/∂U obtained with aDNN indicate the corresponding phase boundary. Finite-sample statistics cause signiﬁcant ﬂuctuations in the dis-tributions P i ( | F | ) and varying predictions ˆ U within thesegregated phase. The ﬂuctuations in P i can yield diver-gence signals close to zero that, again, correspond to mis-leading predictions of phase boundaries within the segre-gated phase.Nevertheless, we can now can use the predictions ˆ U forthe deﬁnition of automatically generated local order pa-rameters. The example in Fig. 3(a) exhibits the largestchange in ˆ U at the transition from non-segregated to seg-regated orderings; ˆ U increases only slowly within the seg-regated phase ( U (cid:38) U canbe qualitatively reproduced by training a linear model asopposed to a DNN [Fig. 3(b),(e)]. The local order param-eter ˆ U based on the linear model is directly interpretablethrough its weights and biases. This approach also worksin the “noise-free” case and with other inputs (Sec. S6 inSM [66]). We thus demonstrate that the construction oflocal order parameters can be automated by training alinear model for each individual phase transition whichwas identiﬁed prior using a DNN with the prediction-based method. Eventually, the DNN-based model canbe replaced by the corresponding set of linear models FIG. 3. Analysis of the transition from non-segregated to segregated orderings occurring at U ≈ U min = 1 to U max = 8 at ﬁxed ρ = 63 / ≈ .

16. Predictions ˆ U and corresponding divergence ∂δU/∂U of (a),(d) a DNN and(b),(e) a linear model, as well as (c),(f) the indicator ∆¯ x [Eq. (6)] based on | F | . (cid:104) ∆¯ x (cid:105) denotes the average diﬀerence signalover the entire line-scan and is subtracted to account for noise arising due to ﬁnite sample statistics. The degree of red in(a)–(c) denotes an increasingly positive value of the respective indicator for phase transitions; (d)–(f) conﬁgurations visualizedusing the same color scale as for the points in (a)–(c), respectively. which is conceptually related to training local surrogatemodels [33].Importantly, the form of the optimal model in theprediction-based method paves the way for a class of com-putationally cheap algorithms for unsupervised phaseclassiﬁcation without any predictive model. Here, we fo-cus on the mean-based method (other methods in Sec. S4in SM [66]). The optimal model predictions reveal thatthe corresponding predicted phase diagram can be repro-duced by detecting changes in observables derived from w which are invariant under transformations of p m .Ideally, the observable should not be very sensitive tosmall changes in w within a stability region.A suitable physically-motivated choice are correlationfunctions that measure the order of a given conﬁguration: κ ξn ( U, ρ ) = 1 m ξn L L (cid:88) i =1 (cid:88) j ∈{ j ξn } (2 w i −

1) (2 w j − , (4)where m ξn is the number of constituents in the set { j ξn } which contains lattice points matching three types of or-ders that measure square ( ξ = sq), axial ( ξ = ax), anddiagonal ( ξ = di) correlations over n lattice sites (illus-tration in Fig. 2(d) and details in Sec. S3 in SM [66]).With these correlation functions, we deﬁne the follow-ing correlation indicator for the mean-based method:∆¯ κ ( U, ρ ) ≡ (cid:107) ¯ κ ( U + ∆ U, ρ ) − ¯ κ ( U − ∆ U, ρ ) (cid:107) , (5)at each point p = ( U, ρ ), where ¯ κ =[¯ κ sq1 , ..., ¯ κ sq L/ , ¯ κ ax1 , ..., ¯ κ ax L/ , ¯ κ di1 , ..., ¯ κ di L/ ] T . Here, the¯ · -notation indicates the average over all inputs at agiven point p , if multiple inputs are considered. Theindicator ∆¯ κ measures the magnitude of the change oforder quantiﬁed by κ ξn . Our results for, both, noisy and noise-free cases demon-strate that the mean-based method with the indicator∆¯ κ reveals the phase diagram more clearly than ∇ p · δp ,compare Fig. 2(e),(f) and (b,c). The indicator ∆¯ κ repro-duces the main characteristics of the FKM phase diagram[Fig. 2(a)]: ∆¯ κ almost vanishes within the stability re-gion of segregated orderings, in the presence or absenceof noise [Fig. 2(e),(f)], and marks all phase boundaries ofFig. 2(a). Moreover, we obtain a detailed, physical subdi-vision of the phase diagram – see the identiﬁed orderingsin Fig. 2(top) and labels (1)–(9) in Fig. 2(e). We inferthat the mean-based method with the correlation indi-cator ∆¯ κ in Eq. (5) is an excellent tool to detect phaseboundaries.We now ask the question if the mean-based methodcan be applied to the phase classiﬁcation problem with-out a speciﬁc physically-motivated input. To this end,we extend the mean-based method to general inputs, bydeﬁning a generic indicator :∆¯ x ( p ) ≡ (cid:107) ¯ x ( U + ∆ U, ρ ) − ¯ x ( U − ∆ U, ρ ) (cid:107) . (6)Here, ¯ x ( p i ) = (cid:80) j P i ( x j ) x j ≈ /N (cid:80) j x j ( p i ) denotesthe average input at a point p i over all corresponding N inputs x j (Sec. S5 in SM [66]).Since the generic indicator ∆¯ x detects phase transi-tions, Eq. (6) establishes a general, data-driven schemewithout a predictive model which is applicable in boththe noise-free and noisy case. In any such data-driven ap-proach the used input x crucially aﬀects the performanceof the phase classiﬁcation; Fig. 2(e),(f) show that the setof correlation functions κ are an appropriate choice in thecase of the FKM (results with diﬀerent inputs in Sec. S6in SM [66]).Another suitable choice for the input for the mean-based method are the Fourier transformed conﬁgura-tions | F | [line-scan along U in Fig. 3(c),(f)]. Even inthis case the diﬀerence signal is a good indicator for thephase transition at U ≈

2. This underpins the generalityand robustness of the mean-based method and shows itspossible applicability to other models beyond the FKM,where the inputs are generally diﬀerent.We stress that the indicators of phase transitions in themean-based method [Eq. (6)] and the prediction-basedmethod [Eq. (2)] diﬀer fundamentally. They constitutetwo distinct approaches to characterize changes in theunderlying distributions P ( x ). However, the mean-basedmethod has some clear advantages: it is computationallycheap and allows for direct physical insights. For exam-ple, by calculating an indicator, Eq. (6), based on eachindividual element of κ or | F | one can directly ﬁnd theelements which have the greatest impact on the overallindicator. This further characterizes each phase transi-tion and simpliﬁes the subsequent analysis.In conclusion, we have rendered the prediction-basedmethod fully interpretable with a derivation of its op-timal model predictions and the automatic local-order-parameter generation using linear surrogate models.Moreover, we have presented a mean-based method thatworks outstandingly well as an unsupervised phase clas-siﬁcation approach for various inputs and in the presence of noise. We infer that applications of our mean-basedmethod to arbitrary phase diagrams featuring, e.g., quan-tum or topological phase transitions are feasible. Specif-ically, applications to quantum-classical systems such asthe FKM and its numerous generalizations [40, 57, 67–69] are now straightforward. The success of the mean-based method suggests extensions to unsupervised phaseclassiﬁcation methods based on higher-order moments oralternative distance measures.We would like to thank Niels L¨orch, Eliska Greplova,Michael Thoss, and Christoph Bruder for inspiring dis-cussions. J.A. and F.S. acknowledge ﬁnancial supportfrom the Swiss National Science Foundation (SNSF) andthe NCCR Quantum Science and Technology. A.U.J.L.acknowledges ﬁnancial support by the Austrian ScienceFoundation (FWF) under grant No. P-32033-N32. Com-putation time on the Hawk cluster at the HLRS Stuttgartand at sciCORE (scicore.unibas.ch) scientiﬁc computingcore facility at University of Basel, as well as supportby the state of Baden-W¨urttemberg through bwHPCand the German Research Foundation (DFG) throughgrants no INST 40/467-1 FUGG (JUSTUS cluster),INST 39/963-1 FUGG (bwForCluster NEMO), and INST37/935-1 FUGG (bwForCluster BinAC) is gratefully ac-knowledged. [1] S. Sachdev, Quantum Phase Transitions (CambridgeUniversity Press, 2011).[2] N. Goldenfeld,

Lectures On Phase Transitions And TheRenormalization Group (CRC Press, 2018).[3] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld,N. Tishby, L. Vogt-Maranto, and L. Zdeborov´a, Rev.Mod. Phys. , 045002 (2019).[4] J. Sethna, Statistical Mechanics: Entropy, Order Param-eters, and Complexity (Oxford University Press, 2006).[5] P. M. Chaikin and T. C. Lubensky,

Principles of Con-densed Matter Physics (Cambridge University Press,1995).[6] J. Carrasquilla and R. G. Melko, Nat. Phys. , 431(2017).[7] E. P. Van Nieuwenburg, Y.-H. Liu, and S. D. Huber,Nat. Phys. , 435 (2017).[8] K. Ch’ng, J. Carrasquilla, R. G. Melko, and E. Khatami,Phys. Rev. X , 031038 (2017).[9] L. Wang, Phys. Rev. B , 195105 (2016).[10] B. S. Rem, N. K¨aming, M. Tarnowski, L. Asteria,N. Fl¨aschner, C. Becker, K. Sengstock, and C. Weit-enberg, Nat. Phys. , 917 (2019).[11] A. Bohrdt, C. S. Chiu, G. Ji, M. Xu, D. Greif, M. Greiner,E. Demler, F. Grusdt, and M. Knap, Nat. Phys. , 921(2019).[12] V. Dunjko and H. J. Briegel, Reports on Progress inPhysics , 074001 (2018).[13] T. Ohtsuki and T. Ohtsuki, Journal of the Physical So-ciety of Japan , 044708 (2017).[14] J. Carrasquilla, Advances in Physics: X , 1797528(2020).[15] Y.-H. Liu and E. P. L. van Nieuwenburg, Phys. Rev. Lett. , 176401 (2018).[16] J. F. Rodriguez-Nieva and M. S. Scheurer, Nat. Phys. ,790 (2019).[17] P. Huembeli, A. Dauphin, and P. Wittek, Phys. Rev. B , 134109 (2018).[18] F. Sch¨afer and N. L¨orch, Phys. Rev. E , 062107 (2019).[19] E. Greplova, A. Valenti, G. Boschung, F. Sch¨afer,N. L¨orch, and S. D. Huber, New Journal of Physics ,045003 (2020).[20] A. Dawid, P. Huembeli, M. Tomza, M. Lewenstein, andA. Dauphin, arXiv:2004.04711 (2020).[21] C. Casert, T. Vieijra, J. Nys, and J. Ryckebusch, Phys.Rev. E , 023304 (2019).[22] S. J. Wetzel, Phys. Rev. E , 022140 (2017).[23] Y. Che, C. Gneiting, T. Liu, and F. Nori,arXiv:2002.02363 (2020).[24] M. S. Scheurer and R.-J. Slager, Phys. Rev. Lett. ,226401 (2020).[25] S. Bl¨ucher, L. Kades, J. M. Pawlowski, N. Strodthoﬀ,and J. M. Urban, Phys. Rev. D , 094507 (2020).[26] Y. Zhang, P. Ginsparg, and E.-A. Kim, Phys. Rev. Re-search , 023283 (2020).[27] K. Liu, J. Greitemann, and L. Pollet, Phys. Rev. B ,104410 (2019).[28] O. Balabanov and M. Granath, Phys. Rev. Research ,013354 (2020).[29] Y. Long, J. Ren, and H. Chen, Phys. Rev. Lett. ,185501 (2020).[30] J. Singh, V. Arora, V. Gupta, and M. S. Scheurer,arXiv:2006.11868 (2020).[31] Y. Bengio and O. Delalleau, in Algorithmic LearningTheory , edited by J. Kivinen, C. Szepesv´ari, E. Ukkonen, and T. Zeugmann (Springer Berlin Heidelberg, Berlin,Heidelberg, 2011) pp. 18–36.[32] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learn-ing (MIT Press, 2016).[33] C. Molnar,

Interpretable Machine Learn-ing (2019) https://christophm.github.io/interpretable-ml-book/ .[34] A. Cole, G. J. Loges, and G. Shiu, arXiv:2009.14231(2020).[35] L. M. Falicov and J. C. Kimball, Phys. Rev. Lett. ,997 (1969).[36] H. J., Proc. Royal Soc. Lond. A , 238 (1963).[37] J. K. Freericks and V. Zlatic, Rev. Mod. Phys. , 1333(2003).[38] R. Lema´nski, J. K. Freericks, and G. Banach, Phys. Rev.Lett. , 196403 (2002).[39] R. Lema´nski, J. K. Freericks, and G. Banach, Journal ofStatistical Physics , 699 (2004).[40] H. Cencarikova and P. Farkasovsk`y, Condens. MatterPhys. , 42701:1 (2011).[41] M. Plischke, Phys. Rev. Lett. , 361 (1972).[42] P. Farkaˇsovsk´y, Phys. Rev. B , 1507 (1995).[43] P. Farkaˇsovsk´y, Phys. Rev. B , R5463 (1995).[44] P. Haldar, M. S. Laad, and S. R. Hassan, Phys. Rev. B , 125147 (2019).[45] A. Kauch, P. Pudleiner, K. Astleithner, P. Thunstr¨om,T. Ribic, and K. Held, Phys. Rev. Lett. , 047401(2020).[46] M. M. Ma´ska, R. Lema´nski, J. K. Freericks, and C. J.Williams, Phys. Rev. Lett. , 060404 (2008).[47] M. M. Ma´ska, R. Lema´nski, C. J. Williams, and J. K.Freericks, Phys. Rev. A , 063631 (2011).[48] A. Hu, M. M. Ma´ska, C. W. Clark, and J. K. Freericks,Phys. Rev. A , 063624 (2015).[49] T. Qin, A. Schnell, K. Sengstock, C. Weitenberg,A. Eckardt, and W. Hofstetter, Phys. Rev. A , 033601(2018).[50] D. O. Maionchi, A. M. C. Souza, H. J. Herrmann, andR. N. da Costa Filho, Phys. Rev. B , 245126 (2008).[51] A. E. Antipov, Y. Javanmard, P. Ribeiro, and S. Kirch-ner, Phys. Rev. Lett. , 146601 (2016).[52] P. Haldar, M. S. Laad, and S. R. Hassan, Phys. Rev. B , 125116 (2017).[53] A. Smith, J. Knolle, D. L. Kovrizhin, and R. Moessner,Phys. Rev. Lett. , 266601 (2017).[54] M. ˇZonda, J. Okamoto, and M. Thoss, Phys. Rev. B , 075124 (2019).[55] T. Ribic, G. Rohringer, and K. Held, Phys. Rev. B ,195105 (2016).[56] T. Ribic, G. Rohringer, and K. Held, Phys. Rev. B ,155130 (2017).[57] J. K. Freericks, V. M. Turkowski, and V. Zlati´c, Phys.Rev. Lett. , 266408 (2006).[58] M. Eckstein and M. Kollar, Phys. Rev. Lett. , 120404(2008).[59] M. Eckstein, M. Kollar, and P. Werner, Phys. Rev. Lett. , 056403 (2009).[60] A. J. Herrmann, N. Tsuji, M. Eckstein, and P. Werner,Phys. Rev. B , 245114 (2016).[61] A. J. Herrmann, A. E. Antipov, and P. Werner, Phys.Rev. B , 165107 (2018).[62] J. Freericks, Transport in Multilayered Nanostructures:The Dynamical Mean-Field Theory Approach (ImperialCollege Press, London, 2006) pp. 1–328. [63] M. ˇZonda and M. Thoss, Phys. Rev. B , 155157 (2019).[64] R. Smorka, M. ˇZonda, and M. Thoss, Phys. Rev. B ,155116 (2020).[65] D. Schattschneider, The American MathematicalMonthly , 439 (1978).[66] See Supplemental Material for details on the simu-lated annealing procedure, the prediction-based, and themean-based method, as well as a discussion of alterna-tive phase classiﬁcation methods and deﬁnitions of or-der parameters and correlations functions which includesRefs. [9, 21, 22, 25, 26, 32, 38, 39, 70–81].[67] M. D. Petrovi´c, B. S. Popescu, U. Bajpai, P. Plech´aˇc,and B. K. Nikoli´c, Phys. Rev. Applied , 054038 (2018).[68] X.-H. Li, Z. Chen, and T. K. Ng, Phys. Rev. B ,094519 (2019).[69] M. Gon¸calves, P. Ribeiro, R. Mondaini, and E. V. Cas-tro, Phys. Rev. Lett. , 126601 (2019).[70] J. Arnold, F. Sch¨afer, M. ˇZonda, and A. U. J.Lode, Interpretable and unsupervised phase classi-ﬁcation (2020) https://github.com/arnoldjulian/Interpretable-and-unsupervised-phase-classification .[71] M. M. Ma´ska and K. Czajka, Phys. Rev. B , 035109(2006).[72] M.-T. Tran, Phys. Rev. B , 205110 (2006).[73] M. ˇZonda, Phase Transitions , 96 (2012).[74] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller, in Neural Networks: Tricks of the Trade (Springer, 2012)pp. 9–48.[75] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,and S. Chintala, in

Advances in Neural Information Pro-cessing Systems 32 , edited by H. Wallach, H. Larochelle,A. Beygelzimer, F. d’Alch´e Buc, E. Fox, and R. Garnett(Curran Associates, Inc., 2019) pp. 8026–8037.[76] D. Kingma and J. Ba, arXiv:1412.6980 (2014).[77] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and ´Edouard Duch-esnay, Journal of Machine Learning Research , 2825(2011).[78] D. L. Kreher and D. R. Stinson, SIGACT News , 33–35(1999).[79] P.-L. Chau and A. Hardwick, Mol. Phys. , 511 (1998).[80] P. Mehta, M. Bukov, C.-H. Wang, A. G. Day, C. Richard-son, C. K. Fisher, and D. J. Schwab, Physics Reports , 1 (2019).[81] A. W. v. d. Vaart, Asymptotic Statistics , Cambridge Se-ries in Statistical and Probabilistic Mathematics (Cam-bridge University Press, 1998).

Interpretable and unsupervised phase classiﬁcation: Supplemental Material

Julian Arnold, Frank Sch¨afer, Martin ˇZonda, and Axel U. J. Lode Department of Physics, University of Basel, Klingelbergstrasse 82, 4056 Basel, Switzerland Institute of Physics, Albert-Ludwigs-Universit¨at Freiburg,Hermann-Herder-Strasse 3, 79104 Freiburg im Breisgau, Germany (Dated: October 13, 2020)This Supplemental Material contains technical details necessary for a reproduction of the results presented in themain text and additional arguments to support our conclusions. We start by addressing details of the simulated an-nealing procedure to generate ground-state conﬁgurations of the Falicov-Kimball model (FKM). We provide technicaldetails on the training and architecture of neural networks in the prediction-based method, as well as a derivationof its optimal predictive model in both the noise-free and noisy case. Next, order parameters and correlation func-tions quantifying the presence of diﬀerent orderings and correlations in the FKM are deﬁned. To contextualize theprediction-based method and motivate the mean-based method, we compare the methods to two alternative phaseclassiﬁcation schemes. Next, technical details on the calculation of the indicator in the mean-based method are givenand potential variations and extensions are discussed. Finally, we provide complementary ﬁgures (Fig. S5-S7) whichsupport the discussion from the main text: the vector ﬁeld arising in the prediction-based method, the inferred two-dimensional phase diagram using the mean-based method with | F | as input, as well as an analysis of the line-scanat ρ = 63 / ≈ .

16 for diﬀerent types of inputs in the noise-free and noisy case. The code for the prediction- andmean-based method that was utilized in this work is open source [70].

CONTENTS

S1. Simulated annealing 8S2. Prediction-based method 8S2.1. Deep neural networks 8S2.2. Linear models 10S2.3. Derivation of optimal predictive model 11S3. Order parameters and correlation functions 13S3.1. Order parameters 13S3.2. Correlation functions 15S4. Alternative phase classiﬁcation methods 15S5. Mean-based method 16S6. Complementary ﬁgures 16

S1. SIMULATED ANNEALING

For a ﬁxed f -particle conﬁguration w , the Hamiltonian of the FKM in Eq. (1) in the main text can be transformedinto H w = (cid:88) j,j (cid:48) h jj (cid:48) d † j d j (cid:48) = (cid:88) α λ w α b † α b α , (S1)where we introduce the matrix elements h jj (cid:48) = U w j δ jj (cid:48) − tδ (cid:104) jj (cid:48) (cid:105) . Its eigenvalues λ α are obtained by numericaldiagonalization. Finding the ground state of the FKM then means to ﬁnd the conﬁguration w which leads to thelowest energy E gs ( w ) = N d (cid:88) α =1 λ w α . (S2)However, even after accounting for the lattice symmetries, the ground-state conﬁgurations of systems with linearsize L = 20 can not be determined exactly by comparing the energies of all possible conﬁgurations w in general.An approximate method is required. Instead of using a reduced set of chosen orderings, as was done in previousstudies [38, 39] of the model, we determine the corresponding f -particle ground-state conﬁguration w usingsimulated annealing.We use an algorithm based on a semi-classical Metropolis Monte Carlo [71], where we use E gs ( w ) in the statisticalweights energy instead of the free energy. This means, that the candidate conﬁguration w c , generated by a randomdisplacement of a single f -particle from the current conﬁguration w , is accepted as new w if E gs ( w c ) ≤ E gs ( w ) ormin(1 , exp[ − β ( E gs ( w c ) − E gs ( w ))]) > r , where r is a random number drawn from a uniform distribution r ∈ [0 , β = 1 /T is the inverse temperature ( T ). We ﬁrst used a classical protocol, where we started at relatively hightemperature T ∼ . t and cooled the sample in 20 −

40 discrete temperature steps to zero. A thermalization processconsisting of 10 − × L updates was done at every time step.However, we have found that an alternative adaptive protocol was much more eﬃcient in lowering the energy.Namely, we start the annealing with a long thermalization at a low temperature (typically T = 0 . t ). In the nextsteps, depending on if the algorithm has found a conﬁguration with lower energy at the current temperature ornot, the temperature was either lowered by dividing its value by a factor between one and two (typically 1 .

25) orincreased by multiplying it by the same factor. The modiﬁed protocol is better in escaping local minima and has lesstroubles with the fact, that the FKM can go through more than one ordered phase with decreasing temperature [72, 73].We have typically used a number of independent runs with random initial conditions. For small lattices ( L ≤ p m . For L = 20 we used 64 runswith random initial conditions, plus several runs with initial conﬁgurations reﬂecting typical ground-state orderingsidentiﬁed for smaller lattices ( L ≤ p = ( U, ρ ) we took the conﬁguration w with the lowest energy and compared it with theenergy calculated using the conﬁguration w obtained as the ground state for the same ρ , but diﬀerent (neighboring) U . The conﬁguration w with the lowest energy was then taken as the ﬁnal ground-state approximation. S2. PREDICTION-BASED METHOD

In this section, we provide details on the architecture and training of deep neural networks (DNNs) and linearmodels used in the prediction-based method. In particular, we list the corresponding hyperparameters employedthroughout this work.

S2.1. Deep neural networks

In this work, we analyzed the ground-state phase diagram of the spinless FKM using the prediction-basedmethod with DNNs as predictive models. These are built as follows: if the NN input is image-like, such asground-state conﬁgurations w or the magnitude of their discrete Fourier transform | F | , we ﬁrst apply K diﬀerentsquare ﬁlters with the same linear size L as the input image [32]. Subsequently, we apply a rectiﬁed linear unit,ReLU ( x ) = max (0 , x ), as an activation function [32]. This results in an output feature map of size 1 × × K whichis then ﬂattened to a feature vector with K elements. In case of vector-like NN inputs, we skip this step. In bothcases, we feed the corresponding vectors into a series of fully-connected layers (FCLs), where ReLUs are used asactivation functions [32]. We note that the NN architecture remains to be optimized systematically. However, such aDNN with suﬃciently many parameters can serve as an accurate predictive model.For training the DNNs, the inputs are standardized by means of the aﬃne transformation x (cid:48) i = x i − ¯ x i σ i , (S3)and the outputs are normalized as x (cid:48) i = x i σ i , (S4)where x i denotes the i -th input/output, ¯ x i and σ i are the mean and standard deviation of the distribution of the i -th input/output over the entire training data. Standardization ensures that the distribution of each transformedinput x (cid:48) i over the entire training data is characterized by (¯ x (cid:48) i = 0 , σ (cid:48) i = 1). Whereas normalization results in thedistribution of each transformed output x (cid:48) i over the entire training data being characterized by (¯ x (cid:48) i = ¯ x i /σ i , σ (cid:48) i = 1).Scaling of the inputs, here by means of standardization, is common practice in the data pre-processing step of machinelearning tasks relying on gradient descent for optimization, as it generally yields a faster convergence rate [74]. Theadditional normalization of the outputs improves the model accuracy when training with a mean-square error (MSE)loss function, as it ensures that the outputs do not diﬀer in size or spread and are consequently treated on an equalfooting during the optimization. The MSE loss function is deﬁned as L MSE = 1 N p N x (cid:88) p (cid:88) x (cid:107) p − ˆ p ( x ) (cid:107) , (S5)where the sum runs over all N p sampled points p in parameter space and all N x inputs x at each point p . Here,ˆ p = (cid:16) ˆ U , ˆ ρ (cid:17) denotes the predictions of the DNN given a particular input x .The DNNs are implemented in PyTorch [75], where the weights and biases are optimized using the stochasticgradient-based optimizer Adam [76] to minimize the loss function L MSE [Eq. (S5)] over a series of epochs. After eachtraining epoch, the vector-ﬁeld divergence ∇ p · δp = ∂δU∂U (cid:12)(cid:12)(cid:12)(cid:12) p + ∂δρ∂ρ (cid:12)(cid:12)(cid:12)(cid:12) p (S6)is calculated based on the predictions ˆ p for each sampled point p in parameter space. This is done using the symmetricdiﬀerence quotient ∂δU∂U (cid:12)(cid:12)(cid:12)(cid:12) p ≈ δU ( U + ∆ U, N f ) − δU ( U − ∆ U, N f )2∆ U , (S7) ∂δρ∂ρ (cid:12)(cid:12)(cid:12)(cid:12) p ≈ δρ ( U, ρ + ∆ ρ ) − δρ ( U, ρ − ∆ ρ )2∆ ρ , where δU = ˆ U − U and δρ = ˆ ρ − ρ . The divergence is averaged over all inputs per point p . The learning rate isreduced by a ﬁxed factor f r if the loss L MSE does not drop below a certain relative threshold value within a givennumber of epochs, referred to as “patience”. Gradients are calculated using backpropagation. During training,weights and biases are updated batch-wise, i.e., during each epoch the entire training data is randomly split intobatches of equal size. For each batch, the predictions and the resulting loss L MSE are calculated and the NNparameters are then updated accordingly.To incorporate conﬁgurations related through transformations of p m we use online data augmentation, i.e., eachtime a conﬁguration is revisited during training a random transformation of p m is performed. Note that if we use | F | as inputs, we do not need to consider any lattice translations. In the case of symmetry-invariant inputs, suchas correlation functions, we do not need to apply any transformations beforehand. Data augmentation is crucial,0as all conﬁgurations related through transformations of p m have the same energy. Therefore, data augmentationaims at removing physically irrelevant diﬀerences between conﬁguration samples and enforces the NN to pick up onpatterns in the input, which are equally present in all the transformed versions. The DNN hyperparameters employedin this work are collected in Tab. S1. The color scale in Fig. 2(b) and (c) in the main text was cut oﬀ at -1 and -2,respectively, for better visualization. There were very few distinct points in parameter space with a divergence signal ∇ p · δp below this cut-oﬀ. Figure 2(b) 2(c) 3(a) S7(a) S7(g) K N tot f r n in and outputs n out of each fully-connected layer (FCL) is denoted as ( n in , n out ). The total number of NN parameters (weights and biases) is denoted as N tot .Default settings are used except where explicitly stated. S2.2. Linear models

To construct directly interpretable local order parameters for each predicted phase transition, we rely on theprediction-based method with linear models. The predictions ˆ p of a linear model are given asˆ p = θ x + b , (S8)with an input vector x , a weight matrix θ and an vector of additive biases b . Evidently, a linear model allows for adirect interpretation in terms of its weights θ and biases b . In addition to the MSE term in Eq. (S5), when traininglinear models we add a L regularization term to our loss function L = L MSE + λ (cid:88) i θ i . (S9)Here, the regularization rate λ controls the strength of our preference for smaller weights θ i , where the sum runsover all weights [32]. In this work, λ is chosen small enough such that the trained model qualitatively yields thesame predictions as a model trained to minimize L MSE . The additional restrictions posed on the model by the L regularization term thereby removes the remaining freedom in its parameters. This is of particular importance whentrying to interpret the model through its weights and biases [21, 25, 26].The linear models are trained to minimize L [Eq. (S9)] using the scikit-learn implementation for Ridge regressionwith default settings [77]. Data augmentation is performed oﬄine by applying n trafo random transformations of p m to each input conﬁguration beforehand analogous to the online variant described previously. The hyperparametersfor training the linear models employed in this work are collected in Tab. S2.1 Figure 3(b) S7(b) S7(h) n trafo

20 20 0 λ − TABLE S2. Hyperparameters for training the linear models employed in this work. Default settings were used except whereexplicitly stated.

S2.3. Derivation of optimal predictive model

In this section, we provide an analytical derivation of the optimal model predictions when using the prediction-basedmethod for phase classiﬁcation in the general noisy case, as well as the special noise-free case.

Noisy case – We assume to have a system with a single, tunable parameter p , which we sample on an equidistant { { ... ... ... ... ~ ... ... ... ... (a) (b) FIG. S1. Schematic representation of a 1D space spanned by the parameter p which is sampled equidistantly with a spacing∆ p . (a) At each point p i samples x are drawn from a given underlying probability distribution x ∼ P i ( x ). (b) Two distinctregions, labelled I and II, in 1D space spanned by the parameter p . For all points p ∈ { p , p , ..., p n } in region I samples aredrawn from the same set x ∈ { x I0 , x I1 , ..., x I α } . Similarly, for all points p ∈ { p n +1 , p n +2 , ..., p m } in region II samples are drawnfrom x ∈ { x II0 , x II1 , ..., x II β } . The points p n and p n +1 (red) make up the boundary of the two regions and the center of mass ofregion I/II is denoted as (cid:104) p (cid:105) I / II (blue), respectively. grid with a grid spacing ∆ p . At each grid point p i , we draw inputs x from an underlying probability distribution x ∼ P i ( x ). This situation is illustrated in Fig. S1(a). We train a model f : x → f ( x ) to infer p from the samples { x } generated at p , i.e., to minimize a MSE loss function L MSE = 1 N p N x (cid:88) p (cid:88) x ( p − f ( x )) . (S10)Here, p runs over all N p sampled grid points and x runs over all N x inputs at p .Let us pick a particular input x j . We can determine the optimal model prediction f opt ( x j ) by minimizing L MSE w.r.t. f ( x j ), i.e., ∂ L MSE ∂f ( x j ) = 2 N p N x (cid:88) p N j x ( p ) ( p − f opt ( x j )) = 0 . (S11)Here, N j x ( p i ) denotes the number of times the particular input x j is present at point p i , and P i ( x j ) ≡ N j x ( p i ) /N x is the associated probability. We can additionally deﬁne ˜P i ( x j ) ≡ P i ( x j ) / (cid:80) p P ( x j ) as the probability of drawingthe particular input x j at point p i compared to all other sampled points p . We then obtain f opt ( x j ) = (cid:88) p ˜P i ( x j ) p. (S12)Repeating this step for all inputs { x } , we ﬁnd that any model f opt which minimizes L MSE will output f opt ( x j ) for aninput x j . This implies that the optimal model predicts the center of mass for a particular input x j , where each gridpoint is weighted according to the probability to draw the input x j . Note that there are no additional restriction onthe form of f .2Given an optimal model f opt , the divergence of δp at a point p i is calculated as ∂δp∂p (cid:12)(cid:12)(cid:12)(cid:12) p i ≈ δp ( p i + ∆ p ) − δp ( p i − ∆ p )2∆ p . (S13)Here, δp = ˆ p − p with ˆ p = N x (cid:80) x f opt ( x ), where the sum runs over all N x samples { x } drawn at point p . Hence, ∂δp∂p (cid:12)(cid:12)(cid:12)(cid:12) p i ≈ ˆ p ( p i + ∆ p ) − ˆ p ( p i − ∆ p )2∆ p − . (S14)The generalization to a parameter space of arbitrary dimension is straightforward.Here, we have derived the divergence signal of an optimal model f opt for the most general (noisy) situation. Thisremoves the need for further interpretation of the DNN because it merely serves to approximate f opt , and therebyrenders the method interpretable. Additionally, it opens up the possibility to approximate f opt without the needof DNNs as universal function approximators. Speciﬁcally, one may compute the optimal model predictions f opt inEq. (S12) from estimates of the underlying probability distributions P ( x ), e.g., obtained using Monte Carlo methods.A comparison of such approaches to NN-based ones will be subject to future studies.Note that even with the form of the optimal model f opt at hand, we still may want to investigate the decision-making of DNNs trying to approximate f opt to obtain physical insights, e.g., using state-of-the-art attributionmethods following Ref. [21, 22]. However, the scheme based on training linear models for predicted phase tran-sitions to construct local order parameters proposed in the main text proved to be more eﬀective to automate this task. Noise-free case – Consider now the special noise-free case in which there are regions along p where samples { x } areidentical up to transformations of p m . A situation with two such regions, labelled I and II is shown in Fig. S1(b). Inparticular, there is a single set x ∈ { x I0 , x I1 , ..., x I α } from which the samples are drawn at each point p ∈ { p , p , ..., p n } within region I. Similarly, samples at all points p ∈ { p n +1 , p n +2 , ..., p m } within region II are drawn from set x ∈{ x II0 , x II1 , ..., x II β } not related by transformations of p m to the set of region I. Consequently, ˜P (cid:0) x Ij (cid:1) = 0 for all points p in region II and ˜P (cid:0) x Ij (cid:1) = const . = 1 /N Ip for all points p in region I, and vice-versa. Here, N Ip denotes the numberof sampled points in region I. Using Eq. (S12) we then obtain f opt (cid:0) x I i (cid:1) = 1 N Ip (cid:88) p ∈ I p = (cid:104) p (cid:105) I , (S15)where p runs over all N Ip sampled points in region I. Similarly, f opt (cid:0) x II i (cid:1) = (cid:104) p (cid:105) II . Meaning, that the optimal modelyields predictions at the center of mass for all samples within a given region.Given an optimal model f opt , the divergence of δp at a point p i is given by Eq. (S14). Hence, for any point p i within region j ∈ { I , II } that is not directly located in vicinity of the boundary of the region, we have ∂δp∂p (cid:12)(cid:12)(cid:12)(cid:12) p i ≈ − . (S16)Conversely, for the two points at the boundary between region I and II we have ∂δp∂p (cid:12)(cid:12)(cid:12)(cid:12) p i ≈ (cid:104) p (cid:105) II − (cid:104) p (cid:105) I p − . (S17)This shows that the prediction-based method predicts a phase transition whenever neighboring inputs cannot berelated through transformations of p m . Equivalently, it predicts phases to be regions in which neighboring inputsare simply related by transformations of p m . The value of the divergence peak at a phase boundary serves as anindicator of the mean extent of the corresponding phases in parameter space. The size of a phase is thus linkedto its stability, i.e., its robustness against variations in the system parameters. This procedure can be generalizedstraightforwardly to a parameter space of arbitrary dimension with an arbitrary number of phases.Figure S2 shows the vector-ﬁeld divergence ∇ p · δp as a function of U at ρ = 35 /

400 for the DNN trained using | F | as input in the noise-free case whose predicted two-dimensional ground-state phase diagram of the FKM isshown in Fig. 2(b) in the main text. The values of the divergence match the results in Eq. (S16) and (S17) almostperfectly. This conﬁrms that our trained predictive model is indeed optimal, i.e., minimizes L MSE .3 FIG. S2. Vector-ﬁeld divergence ∇ p · δp as a function of U at ρ = 35 /

400 obtained analytically based on Eq. (S16) and (S17),as well as numerically using a DNN trained with | F | as input for the two-dimensional ground-state phase diagram in the”noise-free“ case [see Fig. 2(b) in the main text for full predicted phase diagram]. S3. ORDER PARAMETERS AND CORRELATION FUNCTIONS

In this section, we discuss order parameters for segregated, diagonal and axial orderings [cf. labels (1)-(3) in Fig. 2in the main text], and provide deﬁnitions for set of three correlation functions κ ξn measuring square, axial, and diagonalorder. S3.1. Order parameters

To deﬁne order parameters for the diagonal (di), as well as axial (ax) orderings, we introduce appropriate ﬁlters F ξ , ξ ∈ { di , ax } . The values of the order parameters are obtained by taking the Frobenius scalar product of the rawconﬁgurations w with the corresponding ﬁlters. To account for conﬁgurations that are related through transformationsof p m , we also subject the ﬁlters to the corresponding transformations. Ultimately, we take the maximum value overall symmetry-related ﬁlters { F ξ } as the value of the order parameter O ξ for a particular conﬁguration sample w : O ξ ( w ) ≡ max F ξ L L (cid:88) i =1 L (cid:88) j =1 ( F ξ (cid:12) (2 w − )) ij , (S18)where (cid:12) denotes the element-wise product and is the identity matrix. Figure S3 displays representative ﬁlters forthe order parameters of diagonal and axial orderings, where all other ﬁlters can be obtained from these examplesthrough transformations of p m . Note that the ﬁlters have the same size as the conﬁgurations they are applied to,here L = 20. The ﬁlters for lattices of diﬀerent size can be deﬁned analogously by retaining the same patterns as inFig. S3. If necessary, order parameters for other orderings [cf. labels (4)-(9) in Fig. 2 in the main text] can be deﬁnedin a similar manner. FIG. S3. Representative ﬁlter F ξ for computing the order parameter for diagonal ( F di ) or axial ( F ax ) orderings. Here, blackdenotes 1 and white denotes -1. Deﬁning an order parameter for the segregated (sg) ordering is conceptually simple. It amounts to determiningwhether the conﬁguration sample contains a single, connected cluster of f particles. This is implemented by a4backtracking algorithm [78]. We deﬁne a binary order parameter O sg taking on the value 1 (0) if the conﬁgurationsample does (not) contains a single, connected cluster of f particles (as determined by the algorithm).Clearly, all three order parameters O ξ , ξ ∈ { di , ax , sg } share a common set of desired properties [79]. Inparticular, the order parameters are invariant under when the input conﬁguration samples are subjected totransformations of p m . Furthermore, the maximum value of the order parameters is max w O ξ ( w ) = 1 which isonly achieved for samples showing perfect ordering of type ξ . Additionally, the three order parameters deﬁnedby means of ﬁlters can take on values ranging from 0 to 1, indicating the partial presence of the corresponding pattern.Figure S4 shows the values of all three order parameters O ξ , ξ ∈ { di , ax , sg } for each sampled point p = ( U, ρ )in parameter space for the FKM in the noise-free and noisy case. In the noisy case, we additionally average overall available conﬁgurations at each point p to obtain a scalar value. The order parameters reveal the presence ofsegregated, diagonal, and axial orderings marked as (1), (2), and (3) in Fig. 2 in the main text, respectively. FIG. S4. Values of order parameters (a),(d) O sg , (b),(e) O ax , and (c),(f) O di in the (a)-(c) noise-free and (d)-(f) noisy case. S3.2. Correlation functions

As a set of observables derived from the conﬁgurations which remain invariant under transformations of p m wehere consider a minimal set of three diﬀerent correlation functions κ ξn , ξ ∈ { sq , ax , di } . In particular, these measuresquare ( κ sq n ), axial ( κ ax n ), and diagonal correlations ( κ di n ) over n lattice sites. For a conﬁguration sample w , the threecorrelation functions are calculated as κ sq n = 18 nL L (cid:88) i =1 L (cid:88) j =1 n (cid:88) α = − n n (cid:88) β = − n (cid:0) δ | α | ,n + δ | β | ,n − δ | α | + | β | , n (cid:1) (2 w i,j −

1) (2 w i + α,j + β − , (S19) κ ax n = 14 L L (cid:88) i =1 L (cid:88) j =1 (cid:88) α ∈{− n,n } (cid:88) β ∈{− n,n } δ | α | + | β | ,n (2 w i,j −

1) (2 w i + α,j + β − ,κ di n = 14 L L (cid:88) i =1 L (cid:88) j =1 (cid:88) α ∈{− n,n } (cid:88) β ∈{− n,n } δ | α | + | β | , n (2 w i,j −

1) (2 w i + α,j + β − . Since we assume periodic boundary conditions, the largest unique n is given by n = L/

2. The computation of thecorrelation functions from Eq. (S19) is illustrated in Fig. 2(d) of the main text. In case of the FKM on a square lattice,we ﬁnd that these three correlation functions and combinations thereof are suﬃcient to describe most patterns andordering. Therefore, they represent a physically-motivated way of detecting phase transitions based on the magnitudeof the change in order in the set of correlation functions.

S4. ALTERNATIVE PHASE CLASSIFICATION METHODS

The derivation of the optimal model predictions has strengthened our understanding of the prediction-basedmethod. While we have so far relied on DNNs to approximate f opt , it opens up the possibility of devising algorithmswhich yield the same predictions as f opt , or equivalently, reproduce the predicted phase diagram. To reproduce thepredicted ground-state phase diagram in the noise-free case obtained using the prediction-based method [see Fig. 2(b)in the main text], a model simply needs to be sensitive to any change in the conﬁgurations (up to transformations of p m ). If a change is detected, a new phase is declared. Once all the points in parameter space are analysed, the fullphase diagram is obtained. The main approach we take in this work is to adopt a representation of the conﬁgurationswhich is invariant under transformations of p m : the correlation functions in Eq. (S19). Consequently, a simple com-parison of the representations reveals the phase diagram. Furthermore, the extension of this approach to the noisy caseis straightforward and eventually leads to the formulation of the general mean-based method. In what follows, we de-scribe two alternative approaches to reproduce the results of the prediction-based method in the special noise-free case.As a na¨ıve ﬁrst approach, one could compare the ground-state conﬁguration samples of diﬀerent points inparameter space. If a symmetry transformation is found which relates the conﬁguration samples, the correspondingpoints belong to the same phase, otherwise a new phase is declared. Clearly, the computational complexity of suchan approach is reduced signiﬁcantly by considering | F | , as opposed to w to remove the need to consider latticetranslations. Note that such an approach fails when considering the general noisy case.As a second approach, we propose to use the system Hamiltonian as motivated by the simulated annealing procedure.For a given point in parameter space (point I), we take the corresponding ground-state conﬁguration sample andcalculate its energy using the system Hamiltonian at a neighbouring point along U (point II). Additionally, weevaluate the energy of the ground-state conﬁguration sample at point II using the system Hamiltonian at point II. Ifthe diﬀerence in energy is smaller than an appropriate threshold value, we may regard the two samples degenerateand assign them to the same phase. This is valid, since both samples are equally likely to be generated using thesimulated annealing procedure. Otherwise a new phase is declared. Analysing the entire two-dimensional parameterspace in this fashion will eventually yield the same phases as predicted by the prediction-based method. Note thatsuch an approach assumes that we have perfect knowledge of the system Hamiltonian, which is usually not the casein experiments. Furthermore, an extension of this approach to the general noisy case may not be straightforward.6 S5. MEAN-BASED METHOD

As a key result, we propose the mean-based method as a novel data-driven scheme for identifying and characterizingphase transitions in an automated fashion. It relies on a diﬀerence signal ∆¯ x that serves as a generic indicator forphase transitions which is calculated as∆¯ x ( p ) = (cid:107) ¯ x ( U + ∆ U, ρ ) − ¯ x ( U − ∆ U, ρ ) (cid:107) . (S20)Here, ¯ x ( p i ) = (cid:80) j P i ( x j ) x j ≈ /N j x (cid:80) j x j ( p i ) denotes the average input at a point p i , where we average over allcorresponding N j x input x j .Given a set of conﬁgurations { x } at a point p i we perform oﬄine data augmentation (see training procedure forDNNs and linear models in Section S2) by applying n trafo random transformations of p m to each conﬁguration.Based on the augmented set of conﬁgurations, we then compute input features x (such as κ or | F | ) or use theconﬁgurations themselves to obtain the corresponding average ¯ x . In this work, we choose n trafo = 20 , | F | , and corresponding correlation functions κ ,respectively. The diﬀerence in n trafo when using diﬀerent inputs is based on the fact, that a reduced numberof transformations need to be considered when the chosen input is invariant under (parts of) the transforma-tions of p m , as is the case for | F | or κ . Clearly, this results in reduced computational cost compared to usingan input which is (in general) not invariant under transformations of p m , such as the raw conﬁgurations w themselves.Note that the generic indicator in Eq. (S20) can easily be extended to include changes in ρ . This can, for example,be accomplished by deﬁning the generic indicator as∆¯ x ( p ) = (cid:107) ¯ x ( U + ∆ U, ρ ) − ¯ x ( U − ∆ U, ρ ) (cid:107) + (cid:107) ¯ x ( U, ρ + ∆ ρ ) − ¯ x ( U, ρ − ∆ ρ ) (cid:107) . (S21)Similarly, we may extend the indicator to parameter spaces of arbitrary dimension.Clearly, the mean-based method relying on Eq. (S20) as an indicator fails at identifying phase transitions for which¯ x remains unchanged. As diﬀerent inputs will be more suitable depending on the nature of the phase transitionsunder investigation, an appropriate choice of input can likely combat such failures. However, we aim to provide amethod which does not require signiﬁcant tuning of the input, e.g., based on incorporating physical knowledge of thesystem at hand. For this, one may extend the approach to detect changes in the m -th order moments µ x,m of theunderlying probability distributions P ( x ) as opposed to the mean. The corresponding indicators are then given as∆ µ x,m ( p ) = (cid:107) µ x,m ( U + ∆ U, ρ ) − µ x,m ( U − ∆ U, ρ ) (cid:107) . (S22)Equation (S22) represents variants of the mean-based method relying on diﬀerent measures that characterize changesin the underlying probability distributions P ( x ). In particular, such indicators may yield complementary informationabout the corresponding phase transitions.We continue the thought of devising measures which quantify changes in the distributions P ( x ), and can therebyserve as indicators for phase transitions, by considering the Hellinger distance H ( P i , P j ) [81] as a measure for thesimilarity between two probability distributions P i and P j . The corresponding indicator I ( p ) for phase transitionswould then simply be given as I ( p ) ≡ H (cid:0) P ( U +∆ U,ρ ) , P ( U − ∆ U,ρ ) (cid:1) . In future studies, such extensions of the mean-basedmethod should be investigated and compared based on the insights into the phase diagrams they provide and theircomputational cost. S6. COMPLEMENTARY FIGURES

This section contains complementary ﬁgures which depict the vector ﬁeld arising in the prediction-based method(Fig. S5), the inferred two-dimensional phase diagram using the mean-based method with | F | (as opposed to cor-relation functions) as input (Fig. S6), and the analysis of the line-scan at ρ = 63 / ≈ .

16 for diﬀerent types ofinputs in the noise-free and noisy case (Fig. S7).7 U ρ D U E FIG. S5. Vector ﬁeld δp = ˆ p − p obtained using a DNN trained with | F | as input for the two-dimensional phase diagram ofthe FKM in (a) the noise-free and (b) noisy case [see Fig. 2(b),(c) in the main text for predicted phase diagrams, respectively].The vector ﬁeld exhibits a horizontal structure which demonstrates that ρ is predicted with near-perfect accuracy.FIG. S6. Indicator ∆¯ x [Eq. (S20)] based on | F | as input on the two-dimensional parameter space of the FKM in (a) thenoise-free and (b) the noisy case. This shows that the mean-based approach also works when using | F | , as opposed to κ asinput [cf. Fig. 2(e),(f) in the main text]. FIG. S7. Analysis of the transition from non-segregated to segregated orderings occurring along the line-scan from U min = 1to U max = 8 at ﬁxed ρ = 63 / ≈ .

16 [cf. dashed line in Fig. 2(c),(f) in the main text] in (a)–(f) the noisy and (g)–(l)noise-free case. (a)-(f) Predictions ˆ U and corresponding divergence ∂δU/∂U of (a),(d) a DNN and (b),(e) a linear model, aswell as (c),(f) the indicator ∆¯ x [Eq. (S20)] based on κ . (g)-(l) Predictions ˆ U and corresponding divergence ∂δU/∂U of (g),(j) aDNN and (h),(k) a linear model, as well as (i),(l) the indicator ∆¯ x [Eq. (S20)] based on | F | . The degree of red in (a)–(c) and(g)–(i) denotes an increasingly positive value of the respective indicator for phase transitions; (d)–(f) and (j)–(l) ground-stateconﬁgurations w visualized using the same color scale as for the points in (a)–(c), respectively. These results show that theconstruction of local order parameters using linear models can equally be carried out in the noise-free case and using the set ofcorrelation functions κκ