Machine learning in physics: The pitfalls of poisoned training sets
MMachine learning in physics: The pitfalls of poisoned training sets
Chao Fang, Amin Barzeger,
1, 2 and Helmut G. Katzgraber Department of Physics and Astronomy, Texas A & M University, College Station, Texas 77843-4242, USA Microsoft Quantum, Microsoft, Redmond, WA 98052, USA
Known for their ability to identify hidden patterns in data, artificial neural networks are among the mostpowerful machine learning tools. Most notably, neural networks have played a central role in identifying statesof matter and phase transitions across condensed matter physics. To date, most studies have focused on systemswhere different phases of matter and their phase transitions are known, and thus the performance of neuralnetworks is well controlled. While neural networks present an exciting new tool to detect new phases of matter,here we demonstrate that when the training sets are poisoned (i.e., poor training data or mislabeled data) it iseasy for neural networks to make misleading predictions.
PACS numbers: 75.50.Lk, 75.40.Mg, 05.50.+q, 64.60.-i
I. INTRODUCTION
Machine learning methods [1–3] have found applicationsin condensed matter physics detecting phases of matter andtransitions between these on both quantum and classical sys-tems (see, for example, Refs. [4–9]). Different approachesexist, such as lasso [10, 11], sparse regression [12, 13], classi-fication and regression trees [14–16], as well as boosting andsupport vector machines [17–21]. Neural networks [22, 23]are the most versatile and powerful tools, which is why theyare commonly used in scientific applications.Convolutional neural networks (CNNs), in particular, arespecialized neural networks for processing data with a grid-like topology. Familiar examples include time-series data,where samples are taken in intervals, and images (two-dimensional data sets). The primary difference between neu-ral networks and convolutional neural networks lies in howhidden layers are managed. In CNNs, a convolution is appliedto divide the feature space into smaller sections emphasiz-ing local trends. Because of this, CNNs are ideally-suited tostudy physical models on hypercubic lattices. Recently, it wasdemonstrated that CNNs can be applied to the detection ofphase transitions in Edwards-Anderson Ising spin glasses oncubic lattices [24]. It was shown that the critical behavior ofa spin glass with bimodal disorder can be inferred by trainingthe model using data that has Gaussian interactions betweenthe spins. The use of CNNs also results in a reduced numer-ical effort, which means one could potentially access largersystem sizes often needed to overcome corrections to scalingin numerical studies. As such, pairing specialized hardware tosimulate Ising systems [25–27] with machine learning tech-niques might one day elucidate properties of spin glasses andrelated systems. However, as we show in this work, the useof poor input data can result in erroneous or even unphysicalresults. This (here inadvertent) poisoning of the training setis well known in computer science where small amounts ofbad data can strongly affect the accuracy of neural networksystems. For example, Steinhardt et al. [28] demonstratedthat already small amounts of bad data can result in a sizabledrop in the classification accuracy. References [29–31] fur-thermore demonstrate that data poisoning can have a strongeffect in machine learning. Reference [32] focuses on adver- sarial manipulations [33, 34] of simulational and experimentaldata in condensed matter physics applications. In particular,they show that changing individual variables (e.g., a pixel ina data set) can generate misleading predictions This suggeststhat results from machine learning algorithms sensitively relyon the quality of the training input.In this work, we demonstrate that the use of poorly-thermalized Monte Carlo data or simply mislabeled data canresult in erroneous estimates of the critical temperatures ofIsing spin-glass systems. As such, we focus less on adversar-ial cases, but more on accidental cases of poor data prepara-tion. We train a CNN with data from a Gaussian Ising spinglass in three space dimensions and then use data generatedfor a bimodal Ising spin glass to predict the transition temper-ature of the same model system, albeit with different disorder.In addition, going beyond the work presented in Ref. [32],we introduce an analysis pipeline that allows for the precisedetermination of the critical temperature. While good dataresults in a relatively accurate prediction, the use of poorly-thermalized or mislabeled data produce misleading results.This should serve as a cautionary tale when using machinelearning techniques for physics applications.The paper is structured as follows. In Sec. II we introducethe model used in the study, as well as simulation parametersfor both training and prediction data. In addition, we outlinethe implementation of the CNN as well as the approach usedto extract the thermodynamic critical temperature, followedby results and concluding remarks.
II. MODEL AND NUMERICAL DETAILS
To illustrate the effects of poisoned training sets we studythe three-dimensional Edwards-Anderson Ising spin glass[35–39] with a neural network implemented in TensorFlow[40]. The model is described by the Hamiltonian H = − (cid:88) (cid:104) i,j (cid:105) J ij s i s j , (1)where each J ij is a random variable drawn from a given sym-metric probability distribution, either bimodal, i.e., ± with a r X i v : . [ c ond - m a t . d i s - nn ] S e p equal probability, or Gaussian with zero mean and unit vari-ance. In addition, s i = ± represent Ising spins, and the sumis over nearest neighbors on a cubic lattice with N sites.Because spin glasses do not exhibit spatial order belowthe spin-glass transition, we measure the site-dependent spinoverlap [41–43] q i = S αi S βi , (2)between replicas α and β . In the overlap space, the system isreminiscent of an Ising ferromagnet, i.e., approaches for fer-romagnetic systems introduced in Refs. [6, 7] can be used.For low temperatures, q = (1 /N ) (cid:80) i q i → , whereas for T → ∞ , q → . For an infinite system, q abruptly dropsto zero at the critical temperature T c . Therefore, the overlapspace is well suited to detect the existence of a phase transi-tion in a disordered system, even beyond spin glasses. In theoverlap space, the spin-glass phase transition can be visuallyseen as the formation of disjoint islands with identical spinconfigurations. As such, the problem of phase identificationin physical systems is reminiscent of an image classificationproblem where CNN’s are shown to be highly efficient com-pared to fully-connected neural networks (FCN). A. Data generation
We use parallel tempering Monte Carlo [44] to generateconfigurational overlaps. Details about the parameters used inthe Monte Carlo simulations are listed in Tab. I for the trainingdata with Gaussian disorder. The parameters for the predictiondata with bimodal disorder are listed in Tab. II.
TABLE I: Parameters for the training samples with Gaussian disor-der. L is the linear size of a system with N = L spins, N sa is thenumber of samples, N sw is the number of Monte Carlo sweeps foreach of the replicas for a single sample, T min and T max are the lowestand highest temperatures simulated, N T is the number of tempera-tures used in the parallel tempering Monte Carlo method for eachsystem size L , and N con is the number of configurational overlapsfor a given temperature in each instance. L N sa N sw T min T max N T N con .
80 1 .
21 20 10010 10000 40000 0 .
80 1 .
21 20 10012 20000 655360 0 .
80 1 .
21 20 10014 10000 1050000 0 .
80 1 .
21 20 10016 5000 1050000 0 .
80 1 .
21 20 100
B. CNN implementation
We use the same amount of instances used in Ref. [45] with configurational overlaps at each temperature for each in-stance. Because the transition temperature with Gaussian dis-order is T c ≈ . [45–47], following Refs. [6, 8, 48] for thetraining data, we label the convolutional overlaps with temper-atures above . as “1” and those from temperatures below . as “0.” TABLE II: Parameter for the prediction samples with bimodal disor-der. L is the linear size of the system, N sa is the number of samples, N sw is the number of Monte Carlo sweeps for each of the replicasof a single sample, T min and T max are the lowest and highest tem-peratures simulated, N T is the temperature numbers used in paralleltempering method for each linear system size L , and N con is thenumber of configurational overlaps for a given temperature in eachinstance. L N sa N sw T min T max N T N con .
05 1 .
25 12 50010 10000 300000 1 .
05 1 .
25 12 50012 4000 300000 1 .
05 1 .
25 12 50014 4000 1280000 1 .
05 1 .
25 12 50016 4000 1280000 1 .
05 1 .
25 12 500
The parameters for the architecture of the convolutionalneural network are listed in Tab. III. We inherit the structurewith a single layer from Ref. [8]. All the parameters are de-termined by extra validation sample sets, which are also gen-erated from Monte Carlo simulations.
TABLE III: CNN architecture, parameters, and hardware details.Number of Layers Channels in each layer Filter size × × stride Activation function ReLUOptimizer AdamOptimizer( − )Batch size Iteration Software TensorFlow (Python)Hardware Lenovo x86 HPC cluster with a dual-GPUNVIDIA Tesla K80 GPU and 128 GB RAM
Note that we use between and disorder in-stances for the bimodal prediction data, which is approxi-mately / of the numerical effort needed when estimating thephase transition directly via a finite-scaling analysis of MonteCarlo data, as done for example in Ref. [45]. As such, pairinghigh-quality Monte Carlo simulations with machine learningtechniques can result in large computational cost savings . C. Data analysis
Because the configurational overlaps [Eq. (2)] include theinformation about phases, we expect that different phases havedifferent overlap patterns similar to grid-like graphs. There-fore, in the region of a specific phase, it is reasonable to expectthat the classification probability for the CNN to identify thephase correctly should be larger than . As such, it canbe expected that when the classification probability is . , thesystem is at the system-size-dependent critical temperature. Athermodynamic estimate can then obtained via the finite-sizescaling method presented below.Let us define the classification probability as a function oftemperature and system size: p ( T, L ) which can be used as a p [ p r e d i c t i o n e rr o r ] TL = 8 L = 10 L = 12 L = 14 L = 16 dp ∗ /dT ∼ L /ν ml ν ml = 0 . l og ( d p ∗ / d T ) log( L ) -1.2-0.9-0.6-0.300.30.6 1.05 1.1 1.15 1.2 1.25(c) T c = 1 . N o r m a li ze dS l o p e T T c = 1 . ν ml = 0 . p [ p r e d i c t i o n e rr o r ] L /ν ml ( T − T c ) L = 10 L = 12 L = 14 L = 16 P ( x ) FIG. 1: Classification probabilities for different linear system sizes L as a function of temperature T for the prediction of the critical tem-perature of the bimodal Ising spin glass via a CNN trained with datafrom a Gaussian distribution. (a) Prediction probability for differentsystem sizes L near the phase transition temperature. The differentdata sets cross at T c ∼ . . (b) Measurement of ν ml by performinga linear fit in a double-logarithmic scale using the extremum points ofthe derivative of the prediction error with respect to the temperature.(c) Estimate of the critical temperature T c using the coefficient of thelinear term in Eq. (4) (normalized to ) with L /ν ml as the indepen-dent variable. The vertical dashed line shows the temperature wherethe slope vanishes, which corresponds to T c . (d) Finite-size scalingof the data using the previously-estimated value of ν ml and T c . Thedata collapse onto a universal curve indicating that the estimates areaccurate. dimensionless quantity to describe the critical behavior. Fromthe scaling hypothesis, we expect p ( T, L ) to have the follow-ing behavior in the vicinity of the critical temperature T c : (cid:104) p ( T, L ) (cid:105) = ˜ F (cid:104) L /ν ml ( T − T c ) (cid:105) , (3)where the average is over disorder realizations. Note that thecritical exponent ν ml is different from the one calculated us-ing physical quantities. Due to the limited system sizes thatwe have studied, finite-size scaling must be used to reliablycalculate the critical parameters at the thermodynamic limit.Assuming that we are close enough to the critical temperature T c , the scaling function ˜ F in Eq. (3) can be expanded to athird-order polynomial in x = L /ν ml ( T − T c ) . (cid:104) p ( T, L ) (cid:105) ∼ p + p x + p x + p x . (4)First, we evaluate ν ml by noting that to the leading order in x ,the derivative of (cid:104) p ( T, L ) (cid:105) in Eq. (4) with respect to tempera-ture has the following form: d (cid:104) p ( T, L ) (cid:105) dT ∼ L /ν ml (cid:104) p + 2 p L /ν ml ( T − T c ) +3 p L /ν ml ( T − T c ) (cid:105) . (5)
1% poisoned p [ p r e d i c t i o n e rr o r ] TL = 10 L = 12 L = 14 L = 16 FIG. 2: Classification probabilities for different system sizes L foran Ising spin glass with bimodal disorder. 1% of the labels have beenmixed on average. There is no clear sign of the transition. Therefore, the extremum point of d (cid:104) p ( T,L ) (cid:105) dT scales as d (cid:104) p ( T, L ) (cid:105) dT | T = T ∗ ∼ L /ν ml . (6)A linear fit in a double-logarithmic scale then produces thevalue of ν ml (slope of the straight line), which is subsequentlyused to estimate T c . To do so, we turn back to Eq. (4) wherewe realize that the coefficient of the linear term in L /ν ml as the independent variable is proportional to ( T − T c ) thatchanges sign at T = T c . Alternatively, we can vary T c untilthe data for all system sizes collapse onto a common third-order polynomial curve. This is true because the scaling func-tion ˜ F as a function of L /ν ml ( T − T c ) is universal. The errorbars can be computed using the bootstrap method. III. RESULTS USING DATA WITHOUT POISONING
Figure 1 shows results from the CNN trained with well-prepared (thermalized) data from a Gaussian distribution, pre-dicting the phase transition of data from a Bimodal disorderdistribution. Figure 1(a) shows the prediction probabilities fordifferent linear system sizes L as a function of temperature T .The curves cross the p = 0 . line in the region of the tran-sition temperature for the bimodal Ising spin glass. Figures1(b) and 1(c) show the estimates of the exponent ν ml and thecritical temperature T c , respectively using the methods devel-oped in Sec. II C. The critical temperature T c = 1 . isin good agreement with previous estimates (see, for example,Ref. [45]). Finally, in Fig. 1(d), the data points are plottedas a function of the reduced variable x = L /ν ml ( T − T c ) using the estimated values of the critical parameters. The uni-versality of the scaling curve underlines the accuracy of theestimates. IV. RESULTS USING POISONED TRAINING SETS
Although we have shown that the prediction from convolu-tional neural network can be precise, we still need to test howpoisoned data sets impact the final prediction. First, we ran-domly mix the classification labels of the training sample witha probability of , i.e., with a training set of samples,this means only one mislabeled sample on average. Then wetrain the network and use the same samples in the predictionstage. Compared to Fig. 1, Figure 2 shows no clear sign ofa phase transition. This means that mislabeling a very smallportion of the training data can strongly affect the outcome.Given the hierarchical structure of CNNs, errors can easily beamplified in propagation [49, 50], which is a possible expla-nation of the observed behavior. non-equilibrium p [ p r e d i c t i o n e rr o r ] T L = 10 L = 12 L = 14 L = 16 FIG. 3: Classification probabilities for different system sizes L foran Ising spin glass with bimodal disorder. The Gaussian training dataare not thermalized. There is no clear sign of a phase transition. Finally, we test the effects of poorly prepared training data–in this case, the training data are not properly thermalized.Figure 3 shows the prediction results using data with only of the Monte Carlo sweeps needed for thermalization ofthe Gaussian training samples. Although 50% might seemextreme at first sight, it is important to emphasize that ther-malization times (as well as time-to-solution) are typicallydistributed according to fat-tail distributions [51]. In general, users perform at least a factor of additional thermalizationto ensure most instances are in thermal equilibrium. As inthe case where the labels were mixed, a transition cannot beclearly identified. This is strong indication that the trainingdata need to be carefully prepared.We have also studied the effects of poorly-thermalized pre-diction data paired with well-thermalized training data (notshown). In this case, the impacts on the prediction probabili-ties are small but not negligible. V. DISCUSSION
We have studied the effects of poisoned data sets whentraining CNNs to detect phase transitions in physical systems.Our results show that good training sets are a necessary re-quirement for good predictions. Small perturbations in thetraining set can lead to misleading results.We do note, however, that we might not have selected thebest parameters for the CNN. Using cross-validation or boot-strapping might allow for a better tuning of the parametersand thus improve the quality of the predictions. Furthermore,due to the large number of predictors, overfitting is possible.This, however, can be alleviated by the introduction of penaltyterms. Finally, the use of other activation functions and opti-mizers can also impact the results. This, together with thesensitivity towards the quality of the training data that we findin this work suggest that machine learning techniques shouldbe used with caution in physics applications. Garbage in,garbage out . . .
Acknowledgments
We would like to thank Humberto Munoz Bauza and Wen-long Wang for fruitful discussions. This work is supportedin part by the Office of the Director of National Intelligence(ODNI), Intelligence Advanced Research Projects Activity(IARPA), via MIT Lincoln Laboratory Air Force ContractNo. FA8721-05-C-0002. The views and conclusions con-tained herein are those of the authors and should not be in-terpreted as necessarily representing the official policies orendorsements, either expressed or implied, of ODNI, IARPA,or the U.S. Government. The U.S. Government is authorizedto reproduce and distribute reprints for Governmental purposenotwithstanding any copyright annotation thereon. We thankTexas A&M University for access to their Terra cluster. [1] S. O. Haykin,
Neural Networks and Learning Machines (Pear-son, 2008).[2] I. Goodfellow, Y. Bengio, and A. Courville,
Deep Learning (MIT Press, 2016), .[3] C. Bishop,
Pattern Recognition and Machine Learning (Springer-Verlag, New York, 2006).[4] P. Ronhovde, S. Chakrabarty, D. Hu, M. Sahu, K. K. Sahu, K. F. Kelton, N. A. Mauro, and Z. Nussinov, The European PhysicalJournal E , 105 (2011).[5] Z. Nussinov, P. Ronhovde, D. Hu, S. Chakrabarty, B. Sun, N. A.Mauro, and K. K. Sahu, in Information Science for MaterialsDiscovery and Design , edited by T. Lookman, F. J. Alexander,and K. Rajan (Springer International Publishing, Cham, 2016),Springer Series in Materials Science, p. 115.[6] J. Carrasquilla and R. G. Melko, Nature Physics , 431 (2017). [7] K. Ch’ng, J. Carrasquilla, R. G. Melko, and E. Khatami, Phys.Rev. X , 031038 (2017).[8] A. Tanaka and A. Tomiya, J. Phys. Soc. Jpn , 063001 (2017).[9] K. Kashiwa, Y. Kikuchi, and A. Tomiya (2018), (arxiv:cond-mat/1812.01522).[10] F. Santosa and W. W. Symes, SIAM J. Sci. Stat. Comput. ,1307 (1986).[11] R. Tibshirani, Journal of the Royal Statistical Society, Series B , 267 (1994).[12] G. Mateos, J. A. Bazerque, and G. B. Giannakis, Trans. Sig.Proc. , 5262 (2010).[13] J. Quinonero Candela and C. Rasmussen, Journal of MachineLearning Research , 1935 (2005).[14] L. Rokach and O. Maimon, Data Mining With Decision Trees:Theory and Applications (World Scientific Publishing Co., Inc.,River Edge, NJ, USA, 2014), 2nd ed.[15] S. Shalev-Shwartz and S. Ben-David,
Understanding MachineLearning: From Theory to Algorithms (Cambridge UniversityPress, New York, NY, USA, 2014).[16] D. Mehta and V. Raghavan, Theor. Comput. Sci. , 609(2002).[17] G. James, D. Witten, T. Hastie, and R. Tibshirani,
An Intro-duction to Statistical Learning with Application in R (SpringerPress, 2013).[18] C. Hsu, C. Chang, and C. Lin,
A practical guide to supportvector classification (2010).[19] J. C. Platt (MIT Press, Cambridge, MA, USA, 1999), chap. FastTraining of Support Vector Machines Using Sequential Mini-mal Optimization, p. 185.[20] A. Widodo and B.-S. Yang, Mechanical Systems and SignalProcessing , 2560 (2007).[21] T. Joachims, in Proceedings of the 10th European Confer-ence on Machine Learning (Springer-Verlag, Berlin, Heidel-berg, 1998), p. 137.[22] Y. LeCun and Y. Bengio (MIT Press, Cambridge, MA, USA,1998), chap. Convolutional Networks for Images, Speech, andTime Series, p. 255.[23] W. Zhang, K. Itoh, J. Tanida, and Y. Ichioka, Appl. Opt. ,4790 (1990).[24] H. Munoz-Bauza, F. Hamze, and H. G. Katzgraber (2019),(arXiv:cond-mat/1903.06993).[25] R. Alvarez Ba˜nos, A. Cruz, L. A. Fernandez, J. M. Gil-Narvion,A. Gordillo-Guerrero, M. Guidetti, A. Maiorano, F. Mantovani,E. Marinari, V. Martin-Mayor, et al., J. Stat. Mech. P06026(2010).[26] R. A. Ba˜nos, A. Cruz, L. A. Fernandez, J. M. Gil-Narvion,A. Gordillo-Guerrero, M. Guidetti, D. I˜niguez, A. Maiorano,E. Marinari, V. Martin-Mayor, et al., Proc. Natl. Acad. Sci.U.S.A. , 6452 (2012).[27] M. Baity-Jesi, R. Alvarez Ba˜nos, A. Cruz, L. A. Fernandez,J. M. Gil-Narvion, Gordillo-Guerrero, D. I˜niguez, A. Maio-rano, F. Mantovani, E. Marinari, et al., Phys. Rev. E , 032140 (2014).[28] J. Steinhardt, P. W. Koh, and P. Liang, arXiv e-prints (2017),(arXiv:cs/1706.03691).[29] M. Jagielski, A. Oprea, B. Biggio, C. Liu, and B. Nita-Rotaru,C. andLi (2018), (arXiv:cs/1804.00308).[30] R. Alfeld, X. Zhu, and P. Barford, AAAI p. 1452 (2016).[31] Y. Shi, T. Erpek, Y. E. Sagduyu, and J. H. Li (2019),(arXiv:cs/1901.09247).[32] S. Jiang, S. Lu, and D.-L. Deng (2019), (arxiv:cond-mat/1910.13453).[33] B. Nelson, F. Barreno, F. J. Chi, A. D. Joseph, B. I. Ru-binstein, U. Saini, C. Sutton, Tygar.J., and K. Xia, In Proc.First USENIX Workshop on Large-Scale Exploits and Emer-gent Threats, LEET, (2008).[34] A. Newell, L. Potharaju, L. Xiang, and C. Nita-Rotaru, In Proc.Workshop on Artificial Intelligence and Security, AISec, 2014(2014).[35] S. F. Edwards and P. W. Anderson, J. Phys. F: Met. Phys. , 965(1975).[36] K. Binder and A. P. Young, Rev. Mod. Phys. , 801 (1986).[37] M. M´ezard, G. Parisi, and M. A. Virasoro, Spin Glass Theoryand Beyond (World Scientific, Singapore, 1987).[38] A. P. Young, ed.,
Spin Glasses and Random Fields (World Sci-entific, Singapore, 1998).[39] D. L. Stein and C. M. Newman,
Spin Glasses and Complex-ity , Primers in Complex Systems (Princeton University Press,Princeton NJ, 2013).[40] M. Abadi et al. , TensorFlow: A System for Large-Scale Ma-chine Learning (2016), http://tensorflow.org .[41] D. Sherrington and S. Kirkpatrick, Phys. Rev. Lett. , 1792(1975).[42] G. Parisi, J. Phys. A , 1101 (1980).[43] G. Parisi, Phys. Rev. Lett. , 1946 (1983).[44] K. Hukushima and K. Nemoto, J. Phys. Soc. Jpn. , 1604(1996).[45] H. G. Katzgraber, M. K¨orner, and A. P. Young, Phys. Rev. B ,224432 (2006).[46] E. Marinari, G. Parisi, and J. J. Ruiz-Lorenzo, Phys. Rev. B ,14852 (1998).[47] H. G. Katzgraber and I. A. Campbell, Phys. Rev. B , 014462(2005).[48] J. Carrasquilla, K. Ch’ng, R. G. Melko, and E. Khatami, Phys.Rev. X , 031038 (2017).[49] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (MIT Press,Cambridge, MA, USA, 1988), chap. Learning Representationsby Back-propagating Errors, p. 696.[50] R. Hecht-Nielsen (Harcourt Brace & Co., Orlando, FL, USA,1992), chap. Theory of the Backpropagation Neural Network,p. 65.[51] D. S. Steiger, T. F. Rønnow, and M. Troyer, Phys. Rev. Lett.115