[PDF] Extrapolating quantum observables with machine learning: Inferring multiple phase transitions from properties of a single phase

Abstract

We present a machine-learning method for predicting sharp transitions in a Hamiltonian phase diagram by extrapolating the properties of quantum systems. The method is based on Gaussian Process regression with a combination of kernels chosen through an iterative procedure maximizing the predicting power of the kernels. The method is capable of extrapolating across the transition lines. The calculations within a given phase can be used to predict not only the closest sharp transition, but also a transition removed from the available data by a separate phase. This makes the present method particularly valuable for searching phase transitions in the parts of the parameter space that cannot be probed experimentally or theoretically.

Full PDF

EExtrapolating quantum observables with machine learning: Inferring multiple phasetransitions from properties of a single phase

Rodrigo A. Vargas-Hern´andez , John Sous , , , Mona Berciu , , and Roman V. Krems Department of Chemistry, University of British Columbia,Vancouver, British Columbia, Canada, V6T 1Z1 Department of Physics and Astronomy, University of British Columbia,Vancouver, British Columbia, Canada, V6T 1Z1 Stewart Blusson Quantum Matter Institute, University of British Columbia,Vancouver, British Columbia, Canada, V6T 1Z4 (Dated: April 26, 2019)We present a machine-learning method for predicting sharp transitions in a Hamiltonian phasediagram by extrapolating the properties of quantum systems. The method is based on GaussianProcess regression with a combination of kernels chosen through an iterative procedure maximizingthe predicting power of the kernels. The method is capable of extrapolating across the transitionlines. The calculations within a given phase can be used to predict not only the closest sharptransition, but also a transition removed from the available data by a separate phase. This makesthe present method particularly valuable for searching phase transitions in the parts of the parameterspace that cannot be probed experimentally or theoretically.

It is very common in quantum physics to encounter aproblem with the Hamiltonian H = H + αH + βH (1)whose eigenspectrum can be readily computed/measuredin certain limits of α and β , e.g. at α = 0 or at α (cid:29) β ,but not at arbitrary values of α and β . For such prob-lems, it is necessary to interpolate the properties of thequantum system between the known limits or extrap-olate from a known limit. Both the interpolation andextrapolation become exceedingly complex if the systemproperties undergo sharp transitions at some values of α and/or β . Such sharp transitions separate the phases ofthe Hamiltonian (1). Because the wave functions of thequantum system are drastically diﬀerent in the diﬀerentphases [1], an extrapolation of quantum properties acrossphase transition lines is generally considered unfeasible.Here, we challenge this premise. We note that, whilecertain properties of quantum systems undergo a sharpchange at a phase transition, other properties evolvesmoothly through the transition. Using the example ofthree diﬀerent lattice models, we show that the evolu-tion of such properties within a given phase contains in-formation about the transitions and the same propertiesbeyond the transitions. We present a machine-learningmethod that can be trained by the evolution of suchproperties in a given phase to predict the sharp transi-tions and the properties of the quantum system in otherphases by extrapolation. The importance of this result isclear. Characterizing quantum phase transitions embod-ied in model Hamiltonians is one of the foremost goalsof quantum condensed-matter physics. Our work illus-trates the possibility of predicting transitions at Hamil-tonian parameters, where obtaining the solutions of theSchr¨odinger equation may not be feasible.The application of machine learning (ML) tools for solving problems in condensed-matter physics has re-cently become popular [2–34]. In all of these applica-tions, ML is used as an eﬃcient method to solve one ofthree problems: interpolation, classiﬁcation or clustering.Interpolation amounts to ﬁtting multi-dimensional func-tions or functionals, whereas classiﬁcation and clusteringare used to separate physical data by properties. For ex-ample, ML can be used to identify quantum phases of lat-tice spin Hamiltonians [5, 6, 12, 16, 19, 23, 24]. However,in order to identify a quantum phase transition by in-terpolation and/or classiﬁcation, the aforementioned MLmodels must be trained (fed on input) by the data de-scribing both phases on both sides of the transition. Thedistinct feature of the present work is a ML method thatrequires information from only one phase and extrapo-lates the properties of lattice models to and across thetransitions. To illustrate the method, we consider fourdiﬀerent problems: lattice polaron models with zero, oneand two sharp transitions, and the mean-ﬁeld Heisenbergmodel with a critical temperature. In all cases, we showthat the phase transitions (or lack thereof) can be accu-rately identiﬁed.We ﬁrst consider a generalized lattice polaron modeldescribing an electron in a one-dimensional lattice with N → ∞ sites coupled to a phonon ﬁeld: H = X k (cid:15) k c † k c k + X q ω q b † q b q + V e − ph , (2)where c k and b q are the annihilation operators for theelectron with momentum k and phonons with momen-tum q , (cid:15) k = 2 t cos( k ) and ω q = ω = const are the elec-tron and phonon dispersions, and V e − ph is the electron-phonon coupling. We choose V e − ph to be a combinationof two qualitatively diﬀerent terms V e − ph = αH + βH , a r X i v : . [ c ond - m a t . o t h e r] A p r where H = X k,q i √ N [sin( k + q ) − sin( k )] c † k + q c k (cid:16) b †− q + b q (cid:17) (3)describes the Su-Schrieﬀer-Heeger (SSH) [35] electron-phonon coupling, and H = X k,q i √ N sin( q ) c † k + q c k (cid:16) b †− q + b q (cid:17) (4)is the breathing-mode model [36]. The ground state bandof the model (2) represents polarons known to exhibit twosharp transitions as the ratio α/β increases from zero tolarge values [37]. At α = 0, the model (2) describesbreathing-mode polarons, which have no sharp transi-tions [38]. At β = 0, the model (2) describes SSH po-larons, which exhibit one sharp transition in the polaronphase diagram [35]. At these transitions, the groundstate momentum and the eﬀective mass of the polaronchange abruptly. Method.

We use Gaussian Process (GP) regression asthe prediction method [39], described in detail in the Sup-plemental Material [40]. The goal of the prediction is toinfer an unknown function f ( · ) given n inputs x i and out-puts y i . The assumption is that y i = f ( x i ). The function f is generally multidimensional so x i is a vector.GPs do not infer a single function f ( · ), but rather adistribution over functions, p ( f | X , y ), where X is a vec-tor of all known x i and y is a vector of the correspondingvalues y i . This distribution is assumed to be normal. Thejoint Gaussian distribution of random variables f ( x i ) ischaracterized by a mean µ ( x ) and a covariance matrix K ( · , · ). The matrix elements of the covariance K i,j arespeciﬁed by a kernel function k ( x i , x j ) that quantiﬁes thesimilarity relation between the properties of the systemat two points x i and x j in the multi-dimensional space.Prediction at x ∗ is done by computing the conditionaldistribution of f ( x ∗ ) given y and X . The mean of theconditional distribution is [39] µ ( x ∗ ) = n X i d ( x ∗ , x i ) y i = n X i α i k ( x ∗ , x i ) (5)where α = K − y and d = K ( x ∗ , X ) > K ( X , X ) − . Thepredicted mean µ ( x ∗ ) can be viewed as a linear combina-tion of the training data y i or as a linear combination ofthe kernels connecting all training points x i and the point x ∗ , where the prediction is made. In order to train a GPmodel, one must choose an analytical representation forthe kernel function.To solve the interpolation problem, one typically usesa simple form for the kernel. In the limit of large n , anysimple kernel function produces accurate interpolationresults [39]. For example, k can be approximated by anyof the following functions: k LIN ( x i , x j ) = x > i x j (6) k RBF ( x i , x j ) = exp (cid:18) − r ( x i , x j ) (cid:19) (7) k MAT ( x i , x j ) = (cid:18) √ r ( x i , x j ) + 53 r ( x i , x j ) (cid:19) × exp (cid:16) −√ r ( x i , x j ) (cid:17) (8) k RQ ( x i , x j ) = (cid:18) | x i − x j | α‘ (cid:19) − α (9)where r ( x i , x j ) = ( x i − x j ) > × M × ( x i − x j ) and M isa diagonal matrix with diﬀerent length-scales ‘ d for eachdimension of x i . The length-scale parameters ‘ d , ‘ and α are the free parameters. We describe them collectivelyby θ . A GP is trained by ﬁnding the estimate of θ (de-noted by ˆ θ ) that maximizes the logarithm of the marginallikelihood function:log p ( y | X , θ, M i ) = − y > K − y −

12 log | K | − n π (10)For the extrapolation problem, the prediction producedby Eq. (5) is clearly sensitive to the particular choice ofthe kernel function. While diﬀerent interpolation prob-lems can be solved with the same mathematical formof the kernel function, diﬀerent extrapolation problemsgenerally require diﬀerent kernels. The key for success-ful extrapolation is thus to ﬁnd the appropriate kernelfunction. Because we aim to solve a variety of diﬀerentproblems with varying underlying physics, the procedurefor constructing the kernel must be fully automated andindependent of the particular problem under considera-tion. No kernelRBF MAT RQ LINRQ × LIN ···

RQ + MAT ···

RQ + RBFRQ × LIN + RBF ··· RQ × LIN × RBF ··· RQ × LIN + MAT

FIG. 1. Schematic diagram of the kernel construction methodemployed to develop a Gaussian Process model with extrap-olation power. At each iteration, the kernel with the highestBayesian information criterion (11) is selected. The labels inthe boxes correspond to the kernel functions deﬁned in (6)-(9).

Here, we follow Refs. [47, 48], to build a predictionmethod based on a combination of products of diﬀerent

SSH K G S =0.4=0.5=0.6 0.0 0.5 1.0 K /01234 E ( K ) E ( K = ) SS H = S S H = . S S H = . SSH / m * =0.4=0.5=0.6 0.0 0.5 1.0 K /012 E ( K ) E ( K = ) SS H = . SS H = . FIG. 2. Extrapolation of the polaron ground state momentum K GS (left) and eﬀective mass m ∗ (right) across the sharptransition at λ SSH = 2 α /t ~ ω ≈ . β = 0. The black solid curves are the accurate quantumcalculations. The symbols are the predictions of the GP models trained by the full polaron dispersions E ( K ) at values of λ SSH ≤ λ ∗ , where λ ∗ is shown by the vertical lines (solid for circles, dashed for triangles and dot-dashed for pentagons). TheGP models are used for interpolation (open symbols) and extrapolation (full symbols). The algorithm of Figure 1 yields thekernel k RQ × k LIN + k RBF for the GP models represented by the triangles and pentagons, and k RQ × k LIN × k MAT for thecircles. Left inset: the polaron dispersions used as input (dashed curves) and predicted by the GP model (solid curves) with λ ∗ = 0 . λ ∗ = 0 . kernels (6)-(9). To select the best combination, we usethe Bayesian information criterion (BIC) [49],BIC( M i ) = log p ( y | X , ˆ θ, M i ) − |M i | log n (11)where |M i | is the number of kernel parameters of kernel M i . Here, p ( y | X , ˆ θ, M i ) is the marginal likelihood foran optimized kernel ˆ θ . It is impossible to train and trymodels with all possible combinations of kernels. We usean iterative procedure schematically depicted in Figure1. We begin by training a GP model with each of the ker-nels (6)-(9). These kernels have one (LIN), d (RBF andMAT) and 2 (RQ) free parameters [50]. The algorithmthen selects the kernel – denoted k – that leads to themodel with the highest BIC and combines k ( · , · ) witheach of the original kernels k i deﬁned by Eqs. (6)-(9).The kernels are combined as products k ( · , · ) × k i ( · , · )and additions k ( · , · ) + k i ( · , · ). Each kernel in the com-bination is scaled by a constant factor, which introducesanother free parameter for the product or two parame-ters for the sum. For each of the possible combinations,a new GP model is constructed and a BIC is computed.The kernel yielding the highest BIC is then used as anew base kernel k and the procedure is iterated. Thisfully automated algorithm is applied here to four dif-ferent problems, yielding physical extrapolation results,thus showing that Eq. (11) can be used as a criterion forbuilding prediction models capable of physical extrapo-lation. Results.

All GP models are trained by the dispersions E ( K ), where E is the polaron energy and K is the po-laron momentum. These dispersions are calculated forinﬁnite lattices using the Momentum Average (MA) ap-proach from previous work [37, 51–55]. The models aretrained to predict the polaron energy as a function of K , and the Hamiltonian parameters α , β , and ω . Thevectors x i are thus x i ⇒ { K, ω, α, β } , while f ( · ) is thepolaron energy. Once the models are trained, we numer-ically compute the ground state momentum K GS andthe polaron eﬀective mass from the predicted dispersions[56]. Note that we always train all models by the polarondispersions in one phase and the models have no a pri-ori information about the existence of another phase(s).The transition is encoded in the evolution of the polaronband as a function of x . All results are in units of t .Figure 2 shows the predictions for the pure SSH po-laron model ( β = 0, one sharp transition in the po-laron phase diagram). The vertical lines show wherethe training points end and the extrapolation begins.As can be seen, the GP models predict accurately thelocation of the transition and can be used for quan-titative extrapolation in a wide range of the Hamil-tonian parameters to strong electron-phonon coupling.All models, including the ones trained by quantum cal-culations far removed from the transition point, pre-dict accurately the location of the transition. As thecoupling to phonons increases, the polaron develops aphonon-mediated next-nearest-neighbor hopping term: E ( K ) = − t cos( K )+2 t ( λ SSH )cos(2 K ), where t ( λ SSH )is a function of λ SSH [35]. The transition occurs when / SS H K G S / / SS H K G S / FIG. 3. The polaron ground state momentum K GS for themixed model (2) as a function of β/α for λ SSH = 2 α /t ~ ω .The color map is the prediction of the GP models. The curvesare the quantum calculations from Ref. [37]. The models aretrained by the polaron dispersions at the parameter valuesindicated by the white dots. No other information is used.The optimized kernel combination is ( k MAT + k RBF ) × k LIN (upper panel) and ( k MAT × k LIN + k RBF ) × k LIN (lower panel). the second term dominates. Figure 2 shows that the GPmodels trained using the algorithm of Figure 1 extrapo-late accurately this evolution of the polaron energy.The power of this method is better illustrated withthe example of the mixed breathing-mode – SSH model( α = 0, β = 0) with three phases [37]. The dots in Fig-ure 3 represent the points of the phase diagram used fortraining the GP model with the optimized kernels. Re-markably, the model trained by the polaron dispersionsall entirely in one phase predicts both transitions. Thelocation of the ﬁrst transition is predicted quantitatively.The second transition is predicted qualitatively. If themodel is trained by the polaron properties in two sidephases and the prediction is made by extrapolation tolow values of λ SSH (lower panel of Figure 3), both tran- sition lines are predicted quantitatively. m ( T , m ) T = . T = . T m FIG. 4. GP prediction (solid curves) of the free energydensity f ( T, m ) of the mean-ﬁeld Heisenberg model producedby Eq. (12) (dashed curves). Inset: the order parameter m that minimizes f ( T, m ): symbols – GP predictions, dashedcurve – from Eq. (12). The GP models are trained with 330points at 1 . < T < .

08 (shaded area) and − . < m < . As a third independent test, we applied the methodto the Holstein polaron model deﬁned by Eq. (2) with V e − ph = const P k,q c † k + q c k (cid:16) b †− q + b q (cid:17) . Such model isknown to have no transitions [38]. We ﬁnd that themethod presented here can extrapolate accurately thepolaron dispersions to a wide range of the Hamiltonianparameters and yields predictions that exhibit no sign oftransitions. Since it is often not feasible to explore theentire phase diagram with rigorous quantum calculations,especially for models with many independent parameters,predicting the absence of transitions is as important aslocating diﬀerent phases.Finally, we demonstrate the method on an analyti-cally soluble model. We consider the Heisenberg model H = − J P i,j ~S i .~S j in the nearest-neighbor approxima-tion. Employing a mean-ﬁeld description, the resultingfree energy density at temperature T is [1, 57, 58] f ( T, m ) ≈ (cid:18) − T c T (cid:19) m + 112 (cid:18) T c T (cid:19) m , (12)where m is the magnetization and T c = 1 .

25 the criticaltemperature of the phase transition.

T > T c correspondsto the paramagnetic phase, while T < T c is the ferromag-netic phase.We train GP models by the results of Eq. (12) in theparamagnetic phase far away from T c (shaded region inthe inset of Figure 4). We then extrapolate the function f ( T, m ) across the critical temperature and compute theorder parameter m which minimizes f ( T, m ). Figure 4demonstrates that m thus predicted can be accuratelyextrapolated across T c and far into a diﬀerent phase. Thisdemonstrates again the general idea behind the techniquedeveloped here: use ML to predict the evolution of con-tinuous functions that encodes phase transitions.It is important to point out that the iterative kernelselection algorithm of Figure 1 must be analyzed beforethe present method is used for the quantitative extrap-olation. As the iterations continue, the kernels becomemore complex, more prone to overﬁtting and more diﬃ-cult to optimize. The quantitative accuracy of the pre-diction may, therefore, decrease. The Supplemental Ma-terial illustrates the convergence to Figures 2, 3 and 4with the kernel optimization levels and also the increaseof the prediction error after a certain number of levels.To prevent this problem, we stop the kernel optimiza-tion when the prediction error is minimal, as explainedin the Supplemental Material. We emphasize that thisdoes not aﬀect the prediction of the transitions: once acertain level of Figure 1 is reached, kernels from the sub-sequent optimization levels predict the transitions. Wehave conﬁrmed this for all the results (Figures 2, 3 and4) presented here. Thus, if the goal is to predict the pres-ence or absence of transitions, this method can be usedwithout validation. It is suﬃcient to check that subse-quent levels of the kernel optimization do not produce oreliminate transitions. In order to predict quantitativelythe quantum properties by extrapolation, the trainingdata must be divided into the training and validationsets. The models must then be trained with the train-ing set and the error calculated with the validation set.The kernel optimization must then be stopped, when theerror is minimal. This is a common approach to pre-vent the overﬁtting problem in ML with artiﬁcial neuralnetworks. Summary.

We have presented a powerful method forpredicting sharp transitions in Hamiltonian phase dia-grams by extrapolating the properties of quantum sys-tems. The method is based on Gaussian Process re-gression with a combination of kernels chosen throughan iterative procedure maximizing the predicting powerof the kernel. The model thus obtained captures thechange of the quantum properties as the system ap-proaches the transition, allowing the extrapolation of thephysical properties, even across sharp transition lines.We believe that the present work is the ﬁrst exampleof the application of ML for extrapolation of physical ob-servables for quantum systems. We have demonstratedthat the method is capable of using the properties ofthe quantum system within a given phase to predict notonly the closest sharp transition, but also a transitionremoved from the training points by a separate phase.This makes the present method particularly valuable forsearching phase transitions in the parts of the parameterspace that cannot be probed experimentally or theoreti-cally. Given that the training of the models and the pre-dictions do not present any numerical diﬃculty [59], the present method can also be used to guide rigorous theoryor experiments in search for phase transitions. Finally,we must note that, although the present extrapolationmethod works well for all four problems considered, wecannot prove that it is accurate for an arbitrary system sothe predictions must always be validated, as is commonin machine learning.We acknowledge useful discussions with Dries Sels andMathias Sheurer. This work was supported by the Natu-ral Sciences and Engineering Research Council of Canada(NSERC) and by a visiting student fellowship at the In-stitute for Theoretical Atomic, Molecular, and OpticalPhysics (ITAMP) at Harvard University and the Smith-sonian Astrophysical Observatory (J.S.). [1] S. Sachdev,

Quantum phase transitions , Cambridge Uni-versity Press, Cambridge (1999).[2] L.-F. Arsenault, A. Lopez-Bezanilla, O. A. von Lilienfeld,and A. J. Millis,

Phys. Rev. B , 155136 (2014).[3] L.-F. Arsenault, O. A. von Lilienfeld, and A. J. Millis,arXiv:1506.08858.[4] T. Ohtsuki, and T. Ohtsuki, J. Phys. Soc. Japan ,123706 (2016).[5] L. Wang, Phys. Rev. B , 195105 (2016).[6] J. Carrasquilla, and R. G. Melko, Nat. Phys. , 431(2017).[7] E. P. L. van Nieuwenburg, Y.-H. Liu, and S. D. Huber, Nat. Phys. , 435 (2017).[8] P. Broecker, F. Assaad, and S. Trebst, arXiv:1707.00663.[9] S. J. Wetzel, and M. Scherzer, Phys. Rev. B , 184410(2017).[10] S. J. Wetzel, Phys. Rev. E , 022140 (2017).[11] Y.-H. Liu, and E. P. L. van Nieuwenburg, Phys. Rev.Lett. , 176401 (2018).[12] K. Chang, J. Carrasquilla, R. G. Melko, and E. Khatami,

Phys. Rev. X , 031038 (2017).[13] P. Broecker, J. Carrasquilla, R. G. Melko, and S. Trebst, Sci. Rep. , 8823 (2017).[14] F. Schindler, N. Regnault, and T. Neupert, Phys. Rev. B , 245134 (2017).[15] M. J. Beach, A. Golubeva, and R. G. Melko, Phys. Rev.B , 045207 (2018).[16] E. van Nieuwenburg, E. Bairey, and G. Refael, Phys. Rev.B , 060301(R) (2018).[17] N. Yoshioka, Y. Akagi, and H. Katsura, Phys. Rev. B ,205110 (2018).[18] J. Venderley, V. Khemani, and E.-A. Kim, Phys. Rev.Lett. , 257204 (2018).[19] G. Carleo, and M. Troyer,

Science , 602 (2017).[20] M. Schmitt, and M. Heyl,

SciPost Phys. , 013 (2018)[21] Z. Cai, and J. Liu, Phys. Rev. B , 035116 (2017).[22] Y. Huang, and J. E. Moore, arXiv:1701.06246.[23] D.-L. Deng, X. Li, and S. D. Sarma, Phys. Rev. B ,195145 (2017).[24] Y. Nomura, A. Darmawan, Y. Yamaji, and M. Imada, Phys. Rev. B , 205152 (2017).[25] D.-L. Deng, X. Li, and S. D. Sarma, Phys. Rev. X ,021021 (2017). [26] X. Gao, and L.-M. Duan, Nat. Commun. , 662 (2017).[27] G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer,R. Melko, and G. Carleo, Nat. Phys. , 447 (2018).[28] T. Hazan, and T. Jaakkola, arXiv:1508.05133.[29] A. Daniely, R. Frostig, and Y. Singer, NIPS , 2253(2016).[30] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pen-nington, and J. Sohl-Dickstein, Deep Neural Networksas Gaussian Processes ICLR (2018).[31] K. T. Sch¨utt, H. Glawe, F. Brockherde, A. Sanna, K.R. Mller, and E. K. U. Gross,

Phys. Rev. B , 205118(2014).[32] L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, . Draxl,and M. Scheﬄer, Phys. Rev. Lett. , 105503 (2015).[33] F. A. Faber, A. Lindmaa, O. A, von Lilienfeld, and R.Armient

Int. J. Quantum Chem. , 1094 (2015).[34] F. A. Faber, A. Lindmaa, O. A, von Lilienfeld, and R.Armient

Phys. Rev. Lett. , 135502 (2016).[35] D. J. J. Marchand, G. De Filippis, V. Cataudella, M.Berciu, N. Nagaosa, N. V. Prokof’ev, A. S. Mishchenko,and P. C. E. Stamp,

Phys. Rev. Lett. , 266605 (2010).[36] B. Lau, M. Berciu, and G. A. Sawatzky,

Phys. Rev. B , 174305 (2007).[37] F. Herrera, K. W. Madison, R. V. Krems, and M. Berciu, Phys. Rev. Lett. , 223002 (2013).[38] B. Gerlach, and H. L¨owen,

Rev. Mod. Phys. , 63(1991).[39] C. E. Rasmussen, and C. K. I. Williams, GaussianProcess for Machine Learning . MIT Press, Cambridge(2006).[40] Supplemental Material “URL will be inserted by pub-lisher” describes the details of the ML methods, conver-gence and the quantum calculations used for training theML models and includes Refs. [41–46].[41] R. S. Sutton, and A. G. Barto,

Reinforcement Learning,An Introduction . MIT Press, Cambridge (2016).[42] M. Berciu,

Phys. Rev. Lett. , 036402 (2006). [43] M. Berciu and G.L. Goodvin, Phys. Rev. B , 165109(2007).[44] G.L. Goodvin and M. Berciu, Phys. Rev. B , 235120(2008).[45] C.P.J. Adolphs and M. Berciu, Phys. Rev. B , 085149(2014).[46] M. Berciu and H. Fehske, Phys. Rev. B , 085116(2010).[47] D. K. Duvenaud, H. Nickisch, and C. E. Rasmussen, Ad-vances in Neural Information Processing Systems ,226 (2011).[48] D. K. Duvenaud, J. Lloyd, R. Grosse, J. B. Tenen-baum, and Z. Ghahramani, Proceedings of the 30th Inter-national Conference on Machine Learning Research ,1166 (2013).[49] G. Schwarz, The Annals of Statistics (2), 461 (1978).[50] See Supplemental Material “URL will be inserted by pub-lisher” section I.A for more details on combining kernels.[51] M. Berciu, Phys. Rev. Lett. , 036402 (2006).[52] J. Sous, M. Chakraborty, C. P. J. Adolphs, R. V. Krems,and M. Berciu, Sci. Rep. , 1169 (2017).[53] J. Sous, M. Berciu, and R. V. Krems, Phys. Rev. A ,063619 (2017).[54] J. Sous, M. Chakraborty, R. V. Krems, and M. Berciu,arXiv:1805.06109.[55] See Supplemental Material ”URL will be inserted by pub-lisher” section III for more details of MA approach.[56] See Supplemental Material ”URL will be inserted by pub-lisher” section II.A.[57] P. M. Chaikin, and T. C. Lubensky Principles ofcondensed matter physics , Cambridge University Press,Cambridge (1998).[58] See Supplemental Material ”URL will be inserted by pub-lisher” section III.B.[59] See Supplemental Material ”URL will be inserted by pub-lisher” section I.B for the description of the computa-tional scaling for the ML method adopted here. upplemental Material for“Extrapolating quantum observables with machine learning: Inferring multiple phasetransitions from properties of a single phase”

Rodrigo A. Vargas-Hern´andez, John Sous,

1, 2, 3

Mona Berciu,

2, 3 and Roman V. Krems Department of Chemistry, University of British Columbia,Vancouver, British Columbia, Canada, V6T 1Z1 Department of Physics and Astronomy, University of British Columbia,Vancouver, British Columbia, Canada, V6T 1Z1 Stewart Blusson Quantum Matter Institute, University of British Columbia,Vancouver, British Columbia, Canada, V6T 1Z4 (Dated: April 26, 2019) a r X i v : . [ c ond - m a t . o t h e r] A p r The purpose of this supplemental material is to provide details of the numerical calculations we present in thiswork. Sections I and II discuss the machine-learning methods and Sections III – the quantum calculations used totrain the ML models.

I. GP REGRESSION WITH KERNEL COMBINATIONS

Gaussian process (GP) regression is a kernel-based probabilistic non-parametric supervised ML algorithm [1].Within the GP regression framework, the prediction is a normal distribution characterized by a mean µ ( · ) and astandard deviation σ ( · ), given as µ ( x ∗ ) = K ( x ∗ , x ) > (cid:2) K ( x , x ) + σ n I (cid:3) − y (1) σ ( x ∗ ) = K ( x ∗ , x ∗ ) − K ( x ∗ , x ) > (cid:2) K ( x , x ) + σ n I (cid:3) − K ( x ∗ , x ) . (2)Here, • x = ( x , x , ..., x n ) > is a vector of n points in a multi-dimensional parameter space, where the GP model istrained; • x i is a vector of variable parameters.For the case of the polaron models considered here,a x i ⇒ { polaron momentum K, Hamiltonian parameter α, Hamiltonian parameter β, phonon frequency ω } .For the case of the Heisenberg model considered here, x i ⇒ { Temperature T, magnetization ˜ m } ; • y = f ( x ) is a vector of quantum mechanics results at the values of the parameters speciﬁed by x i For the case of the polaron models considered here, y ⇒ polaron energy E .For the case of the Heisenberg model considered here, y ⇒ free energy density; • x ∗ is a point in the parameter space where the prediction y ∗ is to be made; • K ( x , x ) is the n × n square matrix with the elements K i,j = k ( x i , x j ) representing the covariances between y ( x i ) and y ( x j ). The elements k ( x i , x j ) are represented by the kernel function.The GP models are constructed (in the language of ML “trained”) by the quantum mechanics results y at theparameters in x . The unknown in this model is the kernel function. The goal of the training is thus to ﬁnd the bestrepresentation for the kernel function k ( · , · ).In a standard procedure for training a GP model, one begins by assuming some simple analytical functional formfor k ( · , · ). For example, one assumes one of the following functional forms: k LIN ( x i , x j ) = x > i x j (3) k RBF ( x i , x j ) = exp (cid:18) − r ( x i , x j ) (cid:19) (4) k MAT ( x i , x j ) = (cid:18) √ r ( x i , x j ) + 53 r ( x i , x j ) (cid:19) × exp (cid:16) −√ r ( x i , x j ) (cid:17) (5) k RQ ( x i , x j ) = (cid:18) | x i − x j | α‘ (cid:19) − α (6)where r ( x i , x j ) = ( x i − x j ) > × M × ( x i − x j ) and M is a diagonal matrix with diﬀerent length-scales ‘ d for eachdimension of x i . This list represents some of the most commonly used kernel functions.The parameters of this analytical form are then found by maximizing the log marginal likelihood function,log p ( y | x , θ ) = − y > K − y −

12 log | K | − n π ) , (7)where θ denotes collectively the parameters of the analytical function for k ( · , · ) and | K | is the determinant of thematrix K .The marginal likelihood can also be used as a metric to compare diﬀerent kernels. However, care must be takenwhen kernels with diﬀerent numbers of parameters are to be compared. The second term of Eq. (7) directly dependson the number of parameters in the kernel, which makes the log marginal likelihood inappropriate to compare kernelswith diﬀerent numbers of parameters. To overcome this issue, we compare the predictive power of diﬀerent kernelsusing the Bayesian Information criterion (BIC) [3]BIC( M i ) = log p ( y | x , ˆ θ, M i ) − |M i | log n (8)where |M i | is the number of kernel parameters of the kernel M i . Here, p ( y | x , ˆ θ, M i ) is the marginal likelihood forthe optimized kernel ˆ θ which maximizes the logarithmic part. The last term in Eq. (8) acts to penalize kernels withlarger number of parameters to reduce overﬁtting, thus making the predicting model more robust. A. Learning with kernel combinations

Typically, GP regression is used for interpolation of the training points y ( x i ). For this problem, it is suﬃcient tochoose one of the kernel functions above. The choice of the function will determine the eﬃciency of the interpolationmodel, i.e. the number n of training points required for accurate predictions between the training points. However,any of the kernel functions will work for interpolation.As discussed in the main text, this is not the case for extrapolation. For accurate extrapolation, one needs toincrease the complexity of kernels in order to capture the physical behaviour of the training data y ( x i ) well. However,complex kernels come with a risk of overﬁtting. In addition, the ambiguity as to the choice of the kernel functionincreases with the complexity of the kernel function. So, the question is, how to increase the complexity of the kernelfunctions in a systematic way that prevents overﬁtting and results in a model that captures the physical behaviourof the training results?Ideally, one should choose a kernel function that captures all of the physical behaviour of the training data. However,as we explained above, hand-crafting the ‘best’ kernels is not an easy task [1, 5]. In addition, hand-crafting kernelsmay introduce biases, limiting the generality of the prediction. In this work, we do not use any prior information forthe selection of the kernels and we do not hand-craft kernels.Here, we follow Refs. [2, 5] to increase the complexity of kernels by combining the simple functions (3) - (6) intoproducts and sums. Combining diﬀerent kernels can enhance the learning capacity of the GP regression [2].For example, the ﬁrst kernel combination that we describe here is the addition of two kernels like k MAT + k MAT or k RQ + k MAT . This new type of kernel form is capable of learning long-range and short-range correlations between datapoints. Multiplication of kernels is also another possible algebraic operation, for example, k RQ × k MAT . Multiplyingany of the kernels by the linear kernel, e.g. , k RBF × k LIN , leads to a GP regression that can learn increasing variationsof the data. The dot-product/linear kernel, Eq. (3), can be used to construct polynomial kernels. For example, todescribe quadratic functions, one could multiply this kernel by itself: k LIN × k LIN .It becomes clear that combining kernels in GP regression can provide an advantage in describing a variety ofmathematical functions to accurately make predictions. This is the basis behind using GP regression with kernelcombinations for extrapolating observables. To build more robust and ﬂexible GP models, we employ the greedysearch algorithm and the BIC to algorithmically construct the ’optimal’ model. The greedy search is an ‘optimalpolicy’ algorithm [4] that selects the kernel assumed optimal based on the BIC at every step in the search. Theunderlying assumption is that the BIC represents the optimal measure of the kernel performance.The number of free parameters for each of the simple kernels used in this work are • k LIN ( x i , x j ) ⇒ • k RBF ( x i , x j ) ⇒ d • k MAT ( x i , x j ) ⇒ d • k RQ ( x i , x j ) ⇒ d is the dimensionality or number of features of the data. For example, for the results presented in Figure 4 ofthe main text d = 2, x i = [ T, m ].Every kernel considered in this work is scaled by the constant kernel, k c ( x i , x j ) = const . The total number ofparameters of a GP model with any simple kernel considered in this work is thus increased by one due to k c ( x i , x j ) × k X ( x i , x j ), where k X is any of the kernels listed above.As the algorithm depicted in Figure 1 progresses to lower levels, the number of free kernel parameters increasesand the kernels become rather complex. We express such kernels as the sum of products of kernels by distributing allproducts of sums. For example, the kernel used to construct Figure 3 (lower panel) of the main text is,( k MAT × k LIN + k RBF ) × k LIN = k MAT × k LIN × k LIN + k RBF × k LIN , which including the constant kernel is, (cid:0) k c × k MAT × k LIN × k LIN (cid:1) + (cid:0) k c × k RBF × k LIN (cid:1) . B. Numerical diﬃculty of training and using the GP models

In order to train a GP model, one needs to maximize the log-likelihood function in Eq. (7) by iteratively computingthe inverse and the determinant of the correlation matrix K . The dimension of this matrix is n × n , where n is thenumber of training points. In this work, n ≈ − n and the dimension of the matrix is n × n , where n ≈ − K is, at this point known, the prediction may actually be reduced to a scalar product of two vectors of size n ). The numerical evaluation of these products presents no computational diﬃculty. II. SPECIFIC DETAILS OF THE EXTRAPOLATION METHOD

The value of a quantum observable depends on the parameters of the Hamiltonian. One can learn the behavior ofquantum observables using diﬀerent ML models using the following relation E = h ˆ H ( K, α, β, . . . ) i ∼ F ( K, α, β, . . . ) (9)where F ( · ) is any ML model that can learn the dependence between the Hamiltonian parameters and the quantumobservable. In the present work, the quantum observables are the polaron ground state energy and the free-energydensity denoted as E ( · ). We use the algorithm proposed above to learn F ( · ) and hence to extrapolate quantumobservables.The results of Figure 2 of the main text are for the polaron model with β = 0. This ﬁgure presents the extrapolationwith three diﬀerent ML models, represented by circles, triangles and pentagons.For the predictions represented by triangles: • The GP model is trained with 210 points distributed in the ranges 0 ≤ K ≤ π , 0 ≤ λ SSH ≤ . • The GP model is trained with 245 points distributed in the ranges 0 ≤ K ≤ π , 0 ≤ λ SSH ≤ . • The GP model is trained with 175 points distributed in the ranges 0 ≤ K ≤ π , 0 ≤ λ SSH ≤ . α = 0 and β = 0. This ﬁgure presents theextrapolation with the ML models, trained by the quantum calculations at the Hamiltonian parameters shown bywhite circles in Figure 3 of the main text. For each training point (each white circle), we use 16 energy points in inthe range 0 ≤ K ≤ π for the total of 900 training points for the upper panel of Figure 3 and 960 training points forthe lower panel of Figure 3. A. Eﬀective mass and ground state momentum from extrapolated results

Given the GP extrapolated E ( K ), we compute K GS and m ∗ as follows. K GS is the value of the momentum thatminimizes E ( K ) K GS ( α, β, · · · ) = arg min K E ( K, α, β, · · · ) (10)which depends on the Hamiltonian parameters α and β . For all results presented here, we compute K GS by searchingfor the value where E ( K ) is minimum. This procedure is depicted in Figure SM 1.The polaron eﬀective mass is, m ∗ ( λ SSH ) = (cid:20) ∂ E K ( λ SSH ) ∂K (cid:21) − (cid:12)(cid:12)(cid:12) K = K GS (11)To compute m ∗ , we numerically evaluate the second derivative of the extrapolated E ( K ). K / E ( K ) E ( K = ) S S H = S S H = SSH K G S Fig. SM 1:

SSH Polaron dispersions predicted (dashed curves) with a GP model trained by the quantum calculations(black solid curves) from Ref. [6] to result in the kernel k RQ × k LIN + k RBF . The red crosses indicate the positions wherethe polaron dispersion reaches its minimum. Inset: the value of K GS as a function of λ SSH . B. Prediction accuracy convergence (number of training points)

Figure SM 2 illustrates how the accuracy of the extrapolation improves with the number of training points. Weconsider the pure SSH polaron model with one sharp transition. The GP models are trained by quantum results at λ SSH ≤ .

4, which is far below the transition point λ SSH ≈ .

6, and used to predict the polaron properties after thetransition, at λ SSH > .

6. In all the cases presented, the kernel search algorithm depicted in Figure 1 of the mainmanuscript is run for three depth levels.All models are trained by the quantum results at 5 values of λ SSH ≤ .

4, but with a diﬀerent number of points at0 ≤ K ≤ π : 15 (triangles), 25 (squares), 35 (circles). Figure SM 2 clearly shows that the accuracy of the predictiondramatically improves with the number of training points. C. Prediction accuracy convergence (kernel complexity dependence)

For clarity, here, we use the notation “GPL- X ” for the kernel with the highest BIC obtained after X depth levelsof the algorithm depicted in Figure 1 of the main manuscript. “ X ” thus denotes the depth of the kernel optimizationdiagram shown in Figure 1 of the main manuscript. Figures SM 3 and SM 4 show how the accuracy of the predictionof the sharp transitions shown in Figure 3 of the main text improves as the kernel complexity increases. SSH K G S Fig. SM 2:

Ground state momentum K GS for the predicted SSH Polaron dispersions with a GP model trained at λ SSH ≤ . λ SSH (75 points total); orange squares– 25 points per value of λ SSH (125 points total); green circles – 35 points per value of λ SSH (175 points total). The blacksolid curve is the rigorous result from Ref. [6]. / SS H K G S / / SS H K G S / / SS H K G S / Fig. SM 3:

Improvement of the phase diagram shown in Figure 3 (upper panel) of the main manuscript with the kernelcomplexity increasing as determined by the algorithm depicted in Figure 1 of the main manuscript. The panels correspondto the optimized kernels GPL-0 (left), GPL-1 (center), GPL-2 (right), where “GPL- X ” denotes the optimal kernel obtainedafter X depth levels. For all of the calculations presented, we veriﬁed that increasing X (the number of levels in the kernels optimization)does not change the predictions of the phase transitions. This applies to all results in Figures 2, 3 and 4 in the mainmanuscript as well as the Holstein model results discussed in the main text. Once a phase transition (or the absenceof transitions) is identiﬁed, the prediction of the phase transition (or the absence of transitions) is reliable. In otherwords, once a certain level of kernel optimization is reached, all kernels from the subsequent optimization levels predictthe phase transitions or the absence of the phase transitions correctly.However, as the complexity of the kernels increases with each new level X , it becomes more diﬃcult to ﬁnd theoptimal kernel within a given level X . The optimization algorithm is more likely to be stuck in a local minimum.This does not aﬀect the predictions of the phase transitions. However, the quantitative predictions of the quantumproperties in the extrapolated phase may be aﬀected. Both of these points are illustrated in Figure SM 5 (upperright panel). To prevent this problem, in the present work, we stop the optimization algorithm after three levels ofoptimization for the results in Figures 2, 3 (upper panel) and 4. For the results in Figure 3 (lower panel), we stop theoptimization after four levels.These results show that, if the goal is to predict the presence or absence of phase transitions, the method describedhere can be used without validation. It is suﬃcient to ensure that subsequent levels of the kernel optimizationdo not produce or eliminate phase transitions. If the goal is to predict quantitatively the quantum properties byextrapolation, the training data must be divided into a training and validation sets. The models must then be trainedwith the training set and the error calculated with the validation set. The kernel optimization must then be stroppedat the level of the diagram in Figure 1, where the error is the smallest. This is one of the approaches to prevent the / SS H K G S / / SS H K G S / / SS H K G S / / SS H K G S / Fig. SM 4:

Improvement of the phase diagram shown in Figure 3 (lower panel) of the main manuscript with thekernel complexity increasing as determined by the algorithm depicted in Figure 1 of the main manuscript. The panelscorrespond to the optimized kernels GPL-0 (upper left), GPL-1 (upper right), GPL-2 (lower left), GPL-3 (lower right),where “GPL- X ” denotes the optimal kernel obtained after X depth levels. overﬁtting problem in machine learning with artiﬁcial neural networks. III. QUANTUM CALCULATIONS TO OBTAIN TRAINING DATAA. Polaron models

For the polaron models, we use the Momentum Average (MA) approximation yielding accurate results for thepolaron energies in one-dimensional lattices of inﬁnite size [7–9].The MA approach is a non-perturbative quasi-analytical technique designed to solve the equation of motion forthe Green’s function ˆ G ( k, ω ) = h k | ( ω − ˆ H + iη ) − | k i . We use the Dyson’s identity ˆ G ( ω ) = ˆ G ( ω ) + ˆ G ( ω ) ˆ V ˆ G ( ω ) togenerate the hierarchy of equations of motion. ˆ G ( ω ) = ( ω − ˆ H + iη ) − , ˆ G ( ω ) = ( ω − ˆ H + iη ) − with ˆ H = ˆ H − ˆ V e − ph ,and ˆ V = ˆ V e − ph is the electron-phonon coupling term. This hierarchy consists of an inﬁnite set of coupled equationsmaking an exact solution impossible.The MA approach acts to guide an insightful approximation/truncation of the hierarchy allowing for eﬃcient yetaccurate computations by neglecting the exponentially small diagrams in the expansion. The set of diagrams retainedin the hierarchy is identiﬁed by considering the variational meaning of MA: one allows for boson excitations onlywithin a ﬁnite spatial cut-oﬀ from the electron in the polaron cloud [8].This choice of the variational space depends on the details of the Hamiltonian and states of interest [8]. For theHolstein model, a one-site phonon cloud suﬃces to provides accurate results for single polarons [7, 8] and for tightlybound bipolarons [10]. For the SSH model, the coupling to phonons is non-local and therefore a bigger cloud isrequired to yield accurate results. A three-site phonon cloud MA has been shown to be very accurate for such models[6, 12–15].By design, the MA approach computes the proprieties of polarons in inﬁnite lattices by utilizing the momentum No kernelRBF MAT RQ LINRQ × LIN ···

RQ + MAT ···

RQ + RBFRQ × LIN + RBF ··· RQ × LIN × RBF ··· RQ × LIN + MAT

SSH K G S GPL-0GPL-1GPL-2GPL-3GPL-4GPL-5GPL-6GPL-7GPL-8GPL-90.0 0.2 0.4 0.6 0.8 1.0 K / E ( K ) E ( K = ) maxSSH =0.5 GPL-0 SS H = . SS H = . K / E ( K ) E ( K = ) maxSSH =0.5 GPL-2 SS H = . SS H = . Fig. SM 5:

Upper left: Schematic diagram of the kernel combinations with the highest BIC. Upper right: Eﬀect of theincreasing kernel complexity on the extrapolation accuracy. “GPL- X ” denotes the results with the kernel obtained after X depth levels depicted in the upper left panel (e.g. GPL-0 is k RQ and GPL-1 is k RQ × k LIN + k RBF ). Lower panels:polaron dispersions predicted by the GP model with the kernel k RQ (left panel) and kernel k RQ × k LIN + k RBF obtainedat the GPL-2 level (right panel). The dashed curves show the GP model predictions, while the solid curves are the resultsfrom Ref. [6]. The GP models are trained by the quantum results at λ SSH ≤ . space representation. Therefore, ﬁnite size eﬀects have no relevance.The MA data used in this work are of the three-site variational ﬂavor and have been conﬁrmed to be in quantitativeagreement with numerically exact results. The SSH polaron results were veriﬁed against the Bold DiagrammaticQuantum Monte Carlo results in Ref. [6], whereas more complicated extensions have been veriﬁed against theVariational Exact Diagonalization in Refs. [13–15].The polaron energy is obtained from the lowest discrete peak in the imaginary part of the Green’s function. B. Mean-free energy of the Heisenberg model

Here we present the derivation of the dimensionless mean-ﬁeld free energy density of the Heisenberg model westudy with the GP method to predict the transition from ferromagnetic to paramagnetic phase. The Heisenbergmodel Hamiltonian reads H = − J X h i,j i ¯ S i · ¯ S j , (12)where h i, j i only account for nearest-neighbour interactions between diﬀerent spins ¯ S i . The free energy of the Heisen-berg model in the mean-ﬁeld approximation is a function of the magnetization m and the temperature T [16], F ( m, T ) = JzN m gµ B ) − N T ln (cid:20) (cid:18) Jzm T gµ B (cid:19)(cid:21) , (13)where m is deﬁned as m = gµ B h ¯ S i i and z is the coordination number. The Boltzmann constant is set to 1 throughoutthis section.Taylor expanding F ( m, T ) near the transition, where m is vanishingly small, we obtain F ( m, T ) = JzN m gµ B ) − N T " ln(2) + 12 (cid:18) Jz T gµ B (cid:19) m − (cid:18) Jz T gµ B (cid:19) m + · · · . (14)To ﬁnd the critical transition temperature T c , we minimize F ( m, T ): ∂F∂m = 0. Solving graphically, we obtain T c = Jz [16]. We then divide F ( m, T ) by N and T c after subtracting F (0 , T ) to obtain the shifted free energy density f ( m, T ) = Jz gµ B ) (cid:20) − T c T (cid:21) m + 43 ( gµ B ) (cid:18) T c T (cid:19) m . (15)The last step is to deﬁne the dimensionless magnetization ˜ m = mgµ B , yielding f ( ˜ m, T ) = 12 (cid:20) − T c T (cid:21) ˜ m + 112 (cid:18) T c T (cid:19) ˜ m (16)This is Eq. (12) in the main text, where the tilde over m has been omitted to simplify the notation.The shape of the magnetization dependence of the free energy density changes with T . At T c T <

1, the minimizerof f ( ˜ m, T ), i.e. the order parameter (here denoted as m ) is m = 0; while for T c T > m = 0. This is illustratedin Figure 4 of the main text for Jz = 5, i.e. T c = 1 .

25. This form of f ( ˜ m, T ) can be equivalently obtained throughthe phenomenological Landau theory of phase transitions [16]. We use GP regression with kernel combinations toextrapolate the free energy density acrtoss the phase transition. [1] C. E. Rasmussen, and C. K. I. Williams, Gaussian Process for Machine Learning . MIT Press, Cambridge (2006).[2] D. K. Duvenaud, H. Nickisch, and C. E. Rasmussen,

Advances in Neural Information Processing Systems , 226 (2011).[3] G. Schwarz, The Annals of Statistics , 461 (1978).[4] R. S. Sutton, and A. G. Barto, Reinforcement Learning, An Introduction . MIT Press, Cambridge (2016).[5] D. K. Duvenaud, J. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani,

Proceedings of the 30th InternationalConference on Machine Learning Research , 1166 (2013).[6] D. J. J. Marchand, G. De Filippis, V. Cataudella, M. Berciu, N. Nagaosa, N. V. Prokof’ev, A. S. Mishchenko and P. C. E.Stamp, Phys. Rev. Lett. , 266605 (2010).[7] M. Berciu,

Phys. Rev. Lett. , 036402 (2006).[8] M. Berciu and G.L. Goodvin, Phys. Rev. B , 165109 (2007).[9] G.L. Goodvin and M. Berciu, Phys. Rev. B , 235120 (2008).[10] C.P.J. Adolphs and M. Berciu, Phys. Rev. B , 085149 (2014).[11] F. Herrera, K. W. Madison, R. V. Krems and M. Berciu, Phys. Rev. Lett. (22), 223002 (2013).[12] M. Berciu and H. Fehske,