[PDF] Machine learning spectral indicators of topology

Abstract

Full PDF

MMarch 19, 2020

Machine learning spectral indicators of topology

Nina Andrejevic, , ∗ Jovana Andrejevic, ∗ Chris H. Rycroft, , and Mingda Li , Keywords: machine learning, topological materials, X-ray absorption spectroscopy

Topological materials discovery has emerged as an important frontier in condensed matter physics. Recent theoretical approachesbased on symmetry indicators and topological quantum chemistry have been used to identify thousands of candidate topologicalmaterials, yet experimental determination of materials topology often poses signiﬁcant technical challenges. X-ray absorption spec-troscopy (XAS) is a widely-used materials characterization technique sensitive to atoms’ local symmetry and chemical environment;thus, it may encode signatures of materials’ topology, though indirectly. In this work, we show that XAS can potentially uncovermaterials’ topology when augmented by machine learning. By labelling computed X-ray absorption near-edge structure (XANES)spectra of over 16,000 inorganic materials with their topological class, we establish a machine learning-based classiﬁer of topologywith XANES spectral inputs. Our classiﬁer correctly predicts 81% of topological and 80% of trivial cases, and can achieve 90% andhigher accuracy for materials containing certain elements. Given the simplicity of the XAS setup and its compatibility with multimodalsample environments, the proposed machine learning-empowered XAS topological indicator has the potential to discover broader cate-gories of topological materials, such as non-cleavable compounds and amorphous materials. It can also inform a variety of ﬁeld-drivenphenomena in situ , such as magnetic ﬁeld-driven topological phase transitions. Topological materials are deﬁned by the nontrivial topology of their electronic band structures from which they derive their robust andunconventional properties . The allure of developing these exotic phases into useful applications has garnered widespread efforts to identifyand catalogue candidate topological materials to accelerate experimental discovery and synthesis. Recent theoretical progress in topologicalmaterials classiﬁcation based on indicators such as chemistry and crystal symmetry has led to the prediction of over 8,000 topologicallynon-trivial phases, a vast unexplored territory for experiments. This is strong motivation to develop complementary experimental techniquesfor high-throughput screening of candidate materials. Current state-of-the-art techniques such as angle-resolved photoemission spectroscopy(ARPES), scanning tunneling microscopy (STM), and quantum transport measurements are commonly used to detect certain topologicalsignatures, but a few limitations remain. Methods that directly probe band topology typically place strict requirements on sample preparation andthe sample environment, limiting the range of experimentally accessible candidates . Other, more indirect methods can be performed over abroader sample space or without signiﬁcant technical barriers, but the topological character often needs to be inferred with substantial analysis.Neither approach yet fully meets the demands of a high-throughput classiﬁcation task.Machine learning methods are increasingly being adapted to materials research, from materials discovery and property prediction ,including topology inferred from structural and compositional attributes or from quantum theoretical and simulated data , to processingand analysis of complex experimental signatures . This presents a potential opportunity to empower high-throughput experimental techniqueswhich may only indirectly probe topological character through machine learning. X-ray absorption spectroscopy (XAS) is widely used todiscriminate stoichiometrically similar compounds based on differences in the local chemical environment of their constituent atoms, includingtheir coordination, bond angles, and spatial symmetry. Thus, although indirect, XAS signatures are a potentially useful encoding of topologicalcharacter which may be deciphered through machine learning methods to diagnose materials topology.In this work, we have labelled the database of computed X-ray absorption near-edge structure (XANES) spectra according to topologicalclass based on the catalogue of high-quality topological materials predicted by the topological quantum chemistry formalism , and optimized aconvolutional neural network architecture to predict the topological class based on XANES spectral inputs. We ﬁnd that our classiﬁer correctlypredicts an overall 81% of topological cases and 80% of trivial cases, and we discuss its comparative success in terms of precision, recall, and F metrics. In particular, for materials containing common elements, including Be, B, Si, Ca, Ti, Zn, Ga, and Ta, approximately 90% recall fortopological materials is achieved overall. Our work suggests the potential of machine learning to uncover topological character embedded incomplex spectral features, even though a mechanistic understanding is challenging to acquire. Data acquisition and assembly

The materials data used for this study were curated from the Inorganic Crystal Structure Database (ICSD) and labelled according to theirclassiﬁcation in the database of topological materials based on the topological quantum chemistry formalism . XAS data were obtained using thedatabase of computed K-edge XANES spectra distributed on the Materials Project . The materials data were reﬁned based on availabilityof both high-quality topological classiﬁcation and spectral data, resulting in 16 ,

458 total materials considered: 3 ,

121 topological, 13 ,

337 trivial.Details of this procedure are described in the

Methods section, and schematic illustrations of the data reﬁnement procedure and resulting inputdata structure are given in

Fig. 1(a-b) . Additionally, the representation of different elements among topological and trivial examples is shown in

Fig. S1(a-b) . From the assembled dataset of 16 ,

458 samples, training was performed on 70% of the data, reserving 10% for validation and Quantum Matter Group, MIT. Department of Materials Science and Engineering, MIT. John A. Paulson School of Engineering and Applied Sciences,Harvard University. Computational Research Division, Lawrence Berkeley Laboratory. Department of Nuclear Science and Engineering, MIT. ∗ These authorscontributed equally to this work. † [email protected]. †† [email protected]. | a r X i v : . [ c ond - m a t . d i s - nn ] M a r ig. 1 Data assembly and model architecture. (a) Candidate samples considered in this study were extracted from the ICSD databaseaccording to a high-quality ﬁltering procedure (see Methods ). The subset for which computed XANES spectra were available on the MaterialsProject database were used in this study, a portion of which are predicted as topological. (b) A total of 81 absorbing elements were represented inthe assembled data across 16 ,

458 total samples. The XANES spectrum for each sample was subdivided into 81 distinct energy rangescorresponding to each element’s absorption range. (c) A representative, complete XANES spectrum (top) with windows around each absorptionedge (bottom) enlarged for clarity. Upon standardization (see

Methods for details), the spectra of each absorbing element present are inputdirectly into the appropriate element channel of the machine learning model. (d) Schematic of the custom neural network architecture with bothspectral and symmetry group inputs. A separate pipeline of convolutional and max pooling layers processes XANES spectra, while afully-connected layer processes symmetry group information. The two outputs are combined to produce the ﬁnal prediction.20% for testing. While samples were randomly distributed among the data subsets, an assignment process was developed to ensure balancedrepresentation of each absorbing element within each subset, as detailed in the

Methods section. The balanced absorbing element statisticsacross the three subsets achieved by our assignment process are shown in

Fig. S1(c) . Prior to training, we further augmented the data by shiftingeach spectrum ∼ ±

Network architecture optimization

We posed our machine learning task as a binary classiﬁcation of topological character: topological or trivial. A further distinction betweensemimetals and insulators was not pursued, as they are easily differentiated using conventional methods. A typical set of XANES spectra suppliedas input to the machine learning model is shown in

Fig. 1(c) . Details regarding pre-processing of the input data are provided in the

Methods section. Across all samples considered in this work, a total of 81 different elements were represented. A convolutional core network architecturewas constructed as shown schematically in

Fig. S4(a) . Convolutional neural networks are a common approach to learning from translationallyinvariant features such as peaks in a signal; however, they scale easily in the number of model parameters, resulting in a complex model proneto overﬁtting. We employed regularization, dropout, and a systematic tuning of hyperparameters to remedy this issue. Additionally, due tothe high class imbalance—approximately one topological example for every four trivial examples—as well as an imbalance in representationamong different elements, samples were weighted differently to add greater penalty to the misclassiﬁcation of underrepresented cases. Thenetwork weights were optimized on the training data until no substantial improvement in the validation loss could be observed, as shown in

Fig. S5(d) , and all subsequent evaluations were performed on the test set. Additional details on the network implementation and sample weightsare available in the

Methods section.Optimization of this core network architecture led to accurate prediction of approximately 78% of topological and 77% of trivial samples | ig. 2 Exploratory analysis using principal components and k -means clustering. (a) Decision boundary visualizations of classiﬁcations byunsupervised k -means clustering for selected elements along the ﬁrst two principal components. The clustering performance exhibited threesignature patterns: primary segregation of topological examples (ﬁrst row), balanced segregation of topological and trivial examples (secondrow), and no apparent clustering by class (third row). (b) Representative XANES spectra of selected elements colored by topological class. Insetshows the corresponding decision boundary visualizations from (a). Fraction of correctly classiﬁed (c) topological and (d) trivial examples byelement.(see Fig. S4(b) ). About a 3% improvement over this result was achieved by introducing symmetry group as an additional input descriptor,motivated by the recent use of symmetry indicators for theoretical prediction of topological materials . This was incorporated in the ﬁnalnetwork architecture presented in this work and shown schematically in

Fig. 1(d) . A sample’s symmetry group information was represented asthree one-hot vectors encoding the crystal system, point group, and space group. The mixed input of spectral and symmetry group data wasaccommodated by supplying two separate input channels into initially independent sub-networks: The ﬁrst sub-network has the core structuredescribed above, while the second independently processes the symmetry group encoding through a single fully-connected dense layer. Theoutputs of the two separate processing channels converge on a shared segment of the network culminating in the ﬁnal prediction. We note that theneural network architecture showed mildly better performance over a traditional support vector machine (SVM), likewise trained using combinedspectral and symmetry group inputs for comparison (see

Fig. S3 ). Results

Exploratory analysis

Prior to training the neural network classiﬁer, we conducted an exploratory analysis of the assembled XANES spectra to gauge the separabilityby topological class exhibited by different elements. For all examples containing a given element, we performed a principal component analysis(PCA) on the pre-processed high-dimensional spectra and subsequently carried out unsupervised k -means clustering on the ﬁrst two principalcomponents of the training set. Results of the clustering analysis for a selection of elements are shown in Fig. 2(a) , with corresponding originalspectra shown for six example elements in

Fig. 2(b) . The decision boundary between the two clusters identiﬁed by k -means clustering liesat the intersection of the blue (trivial) and orange (topological) shaded regions. Since unsupervised clustering is blind to the true topologicalclass of the examples, cluster assignment was performed by solving an optimal matching problem which ﬁnds the pairing between clusters andtopological classes that minimizes the number of misclassiﬁed examples, corrected for class imbalance. The examples from all three datasets(training, validation, and testing) are plotted as scattered points in the low-dimensional space and colored according to their known topologicalclass. Additional visualizations are shown in Fig. S2(a) . The fraction of scattered points from the testing set (unseen during cluster assignment)which were clustered consistently with their topological class is given for each element in

Fig. 2(c) and

Fig. 2(d) for topological and trivialexamples, respectively. A quick survey of these results reveals a number of elements for which the classiﬁcation accuracy of topological andtrivial examples exhibits a clear trade-off, and a few for which the classiﬁcation accuracy is more balanced. We correlated these observations | ig. 3 Convolutional neural network (CNN) classiﬁer performance on topological examples with combined spectral and symmetrygroup input. (a) The recall of the topological class, computed independently for the subsets of test data containing each absorbing element.Each element’s entry lists its atomic number, atomic symbol, and recall, and is colored according to the recall. Due to low overall count,topological samples containing Tc and Pa were absent from the testing set; thus their entries are marked as zero. The corresponding (b) precisionand (c) F score for each absorbing element’s subsamples. (d) Two-dimensional projection of the decision boundary, with the classiﬁcationthreshold set at 0.5. Testing data are projected along the ﬁrst two principal components of their learned low-dimensional representation (see maintext for details), and plotted as scattered points colored according to true topological class. (e) Comparative plots of the overall recall, precision,and F scores for topological examples (left panel) and trivial examples (right panel) achieved using each of the methods described in the text: k -means clustering ( k -means); support vector machine trained on combined spectral and symmetry group inputs (SVM-sg); convolutional neuralnetwork with only spectral input (CNN); and convolutional neural network with combined spectral and symmetry group inputs (CNN-sg).with the decision boundary visualizations and noted three distinct patterns in the result of our unsupervised clustering. For some elements, nearlyall topological examples were segregated within a single cluster (ﬁrst row of Fig. 2(a) ). This led to a strong score for topological examplesbut weaker score for trivial ones for elements like As, Ge, and Si. Other elements like Mg, Mo, and Ti exhibited more balanced classiﬁcationaccuracies between the two topological classes (second row of

Fig. 2(a) ). On the other hand, there were a number of unsuccessful clusteringexamples such as Ba, F, and K for which clustering of the data did not appear coincident with topological class (third row of

Fig. 2(a) ). Giventhat the feature transformations performed in our exploratory analysis were purely linear, its potential to discriminate data between the twoclasses using certain elements was already encouraging. It further suggested a possible advantage to using more complex, non-linear methods,such as that enabled by a neural network classiﬁer, to improve prediction accuracy for both classes.

Machine learning model performance

We now proceed to a detailed analysis of our highest-performing model, the custom neural network architecture with symmetry group inputdepicted in

Fig. 1(d) . The strong class imbalance makes simple accuracy a less reliable quantiﬁcation of model performance; thus, we use threedifferent metrics in assessing the quality of prediction: recall, precision, and F score. Let t p and t n denote the number of true positive and truenegative predictions, and f p and f n denote the number of false positive and false negative predictions, respectively. The metrics are then deﬁnedas recall: r = t p t p + f n (1a)precision: p = t p t p + f p (1b)F score: F = p · rp + r (1c) | hese scores may be computed independently for different subsets of the testing data; thus, we compute separate scores for topological andtrivial classes that were achieved by each of the four methods discussed in this work: k -means clustering, SVM trained on combined spectraland symmetry group inputs; convolutional neural network with only spectral input; and convolutional neural network with combined spectraland symmetry group inputs. These results are presented in Fig. 3(e) , revealing highly comparable performance among the supervised methods,with slight improvement of all three scores (between 1% to 5%) achieved by including symmetry group information. We further disaggregatethese metric scores by absorbing element as shown in

Fig. 3(a-c) for topological examples, and

Fig. S5(a-c) for trivial examples, achieved usingthe ﬁnal neural network classiﬁer. Disaggregated scores for the SVM with combined spectral and symmetry group inputs (

Fig. S3(c-h) ), andneural network classiﬁer with only spectral inputs

Fig. S4(c-h) are also computed for comparison. The recall is 81% for the topological and80% for the trivial class overall. The trivial class predictions also demonstrate a high overall precision of 95%; few topological samples aremistaken for trivial ones. Conversely, a larger fraction of trivial samples are misclassiﬁed as topological, resulting in a relatively low precision(49%) for the topological class. While this result appears less favorable, the high false positive rate may also suggest that other indicatorsbeyond symmetry-based signatures may be needed to diagnose additional classes of topological materials . The F score is valuable as acomposite quantiﬁer of prediction quality, but we do bear in mind the effect of precision in the comparatively lower F score of the topologicalclass predictions. We further visualize the learned representation of the samples produced by our neural network classiﬁer by probing the8-dimensional encoded space preceding the ﬁnal classiﬁcation layer. For visualization purposes, we plot the projection of each test sample’srepresentation in this 8-dimensional space onto its ﬁrst two principal components, displayed in Fig. 3(d) . Discussion

We compared the effectiveness of the exploratory PCA with unsupervised clustering, and that of the ﬁnal neural network classiﬁer. It is importantto acknowledge the differences in approach of the two methods beyond the model complexity. In particular, the PCA was conducted separatelyfor subsets of samples containing a particular element, and the unsupervised clustering results independently suggested that a certain elementwas either an effective or ineffective predictor of topology based on the clustering accuracy. By contrast, in the neural network classiﬁer, allelements were considered simultaneously, so that a sample’s classiﬁcation was based on information from spectra of all its absorbing elements.This holistic approach does not easily decouple the predictive ability of elements from one another, but we do see some trends maintained acrossboth approaches. A number of lighter transition metals tend to have balanced success in both the PCA and neural network approaches, while thehalogens continue to be relatively poor predictors of topology. The elements from the boron and carbon families scored well for both topologicaland trivial samples in the neural network approach, but not in unsupervised clustering. Interestingly, elements like Be and Ca exhibited some ofthe highest scores from the neural network classiﬁcation, which were furthermore equitable among topological and trivial examples (compare

Fig. 3(a) and

Fig. S5(a) ). This occurred despite the fact that the fraction of topological examples containing either Be or Ca was not necessarilyhigh (see

Fig. S1(c) ) compared to elements like Ce, Ir, and Ni, which had among the highest fractional abundance of topological examples,and for which the fraction of correctly classiﬁed topological examples tended to signiﬁcantly outweigh that of trivial examples. Moreover,unsupervised clustering was comparatively unsuccessful for these two elements, tending to heavily favor topological examples. We do note thatthe success of the neural network classiﬁer can be attributed signiﬁcantly to the presence of particular elements; further work is being pursued tomore accurately decouple this contribution from that of more subtle variations in the XAS spectral features for a given absorbing element.

Conclusion

We explored the predictive power of XAS as a potential discriminant of topological character by training and evaluating a convolutional neuralnetwork classiﬁer on more than 16 ,

000 examples of computed XANES spectra labelled according to one of the largest catalogues of topologicalmaterials . A number of important extensions are envisioned for this work, such as its application to experimental XANES data, incorporationof a multi-ﬁdelity approach to favor experimentally validated examples , expansion of the energy range to the extended X-ray absorption ﬁnestructure (EXAFS) regime, and inquiry into the detailed contribution from spectral features for individual elements. Our results demonstrate apromising pathway to develop robust experimental protocols for high-throughput screening of candidate topological materials aided by machinelearning methods. Additionally, the ﬂexibility of the XAS sample environment can further enable the study of materials whose topologicalphases emerge when driven by electric, magnetic, or strain ﬁelds, and even presents the opportunity to study topology with strong disorder andtopology in amorphous materials . Thus, machine learning-empowered XAS may be poised to become a simple but powerful experimentaltool for topological classiﬁcation. References

Nature Physics , vol. 15, no. 5,pp. 470–476, 2019.2 F. Tang, H. C. Po, A. Vishwanath, and X. Wan, “Comprehensive search for topological materials using symmetry indicators,”

Nature , vol. 566, no. 7745,pp. 486–489, 2019. | H. C. Po, A. Vishwanath, and H. Watanabe, “Symmetry-based indicators of band topology in the 230 space groups,”

Nature Communications , vol. 8, no. 1,pp. 1–9, 2017.4 J. Kruthoff, J. de Boer, J. van Wezel, C. L. Kane, and R.-J. Slager, “Topological classiﬁcation of crystalline insulators through band structure combinatorics,”

Phys. Rev. X , vol. 7, p. 041069, Dec 2017.5 R.-J. Slager, A. Mesaros, V. Juriˇci´c, and J. Zaanen, “The space group classiﬁcation of topological band-insulators,”

Nature Physics , vol. 9, no. 2, pp. 98–102,2013.6 B. Bradlyn, L. Elcoro, J. Cano, M. Vergniory, Z. Wang, C. Felser, M. Aroyo, and B. A. Bernevig, “Topological quantum chemistry,”

Nature , vol. 547, no. 7663,pp. 298–305, 2017.7 M. Vergniory, L. Elcoro, C. Felser, N. Regnault, B. A. Bernevig, and Z. Wang, “A complete catalogue of high-quality topological materials,”

Nature , vol. 566,no. 7745, pp. 480–485, 2019.8 K. Mathew, C. Zheng, D. Winston, C. Chen, A. Dozier, J. J. Rehr, S. P. Ong, and K. A. Persson, “High-throughput computational X-ray absorption spectroscopy,”

Scientiﬁc data , vol. 5, p. 180151, 2018.9 M. Z. Hasan and C. L. Kane, “Colloquium: Topological insulators,”

Rev. Mod. Phys. , vol. 82, pp. 3045–3067, Nov 2010.10 X.-L. Qi and S.-C. Zhang, “Topological insulators and superconductors,”

Rev. Mod. Phys. , vol. 83, pp. 1057–1110, Oct 2011.11 A. Bansil, H. Lin, and T. Das, “Colloquium: Topological band theory,”

Rev. Mod. Phys. , vol. 88, p. 021004, Jun 2016.12 N. P. Armitage, E. J. Mele, and A. Vishwanath, “Weyl and Dirac semimetals in three-dimensional solids,”

Rev. Mod. Phys. , vol. 90, p. 015001, Jan 2018.13 O. Vafek and A. Vishwanath, “Dirac fermions in solids: from high-Tc cuprates and graphene to topological insulators and Weyl semimetals,”

Annu. Rev.Condens. Matter Phys. , vol. 5, no. 1, pp. 83–112, 2014.14 R. Chen, H. C. Po, J. B. Neaton, and A. Vishwanath, “Topological materials discovery using electron ﬁlling constraints,”

Nature Physics , vol. 14, no. 1,pp. 55–61, 2018.15 T. Zhang, Y. Jiang, Z. Song, H. Huang, Y. He, Z. Fang, H. Weng, and C. Fang, “Catalogue of topological electronic materials,”

Nature , vol. 566, no. 7745,pp. 475–479, 2019.16 K. Choudhary, K. F. Garrity, and F. Tavazza, “High-throughput discovery of topologically non-trivial materials using spin-orbit spillage,”

Scientiﬁc Reports ,vol. 9, no. 1, pp. 1–8, 2019.17 A. Damascelli, Z. Hussain, and Z.-X. Shen, “Angle-resolved photoemission studies of the cuprate superconductors,”

Rev. Mod. Phys. , vol. 75, pp. 473–541,Apr 2003.18 S. Suga and A. Sekiyama,

Photoelectron Spectroscopy: Bulk and Surface Electronic Structures , vol. 176. Springer, 2013.19 P. Raccuglia, K. C. Elbert, P. D. Adler, C. Falk, M. B. Wenny, A. Mollo, M. Zeller, S. A. Friedler, J. Schrier, and A. J. Norquist, “Machine-learning-assistedmaterials discovery using failed experiments,”

Nature , vol. 533, no. 7601, pp. 73–76, 2016.20 G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, and R. Ramprasad, “Accelerating materials property predictions using machine learning,”

Scientiﬁc reports ,vol. 3, no. 1, pp. 1–6, 2013.21 L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, “A general-purpose machine learning framework for predicting properties of inorganic materials,” npjComputational Materials , vol. 2, p. 16028, 2016.22 J. Carrete, W. Li, N. Mingo, S. Wang, and S. Curtarolo, “Finding unprecedentedly low-thermal-conductivity half-heusler semiconductors via high-throughputmaterials modeling,”

Physical Review X , vol. 4, no. 1, p. 011019, 2014.23 N. Claussen, B. A. Bernevig, and N. Regnault, “Detection of topological materials with machine learning,” arXiv preprint arXiv:1910.10161 , 2019.24 J. F. Rodriguez-Nieva and M. S. Scheurer, “Identifying topological order through unsupervised machine learning,”

Nature Physics , vol. 15, no. 8, pp. 790–795,2019.25 Y. Zhang and E.-A. Kim, “Quantum loop topography for machine learning,”

Phys. Rev. Lett. , vol. 118, p. 216401, May 2017.26 W. Lian, S.-T. Wang, S. Lu, Y. Huang, F. Wang, X. Yuan, W. Zhang, X. Ouyang, X. Wang, X. Huang, L. He, X. Chang, D.-L. Deng, and L. Duan, “Machinelearning topological phases with a solid-state quantum simulator,”

Phys. Rev. Lett. , vol. 122, p. 210503, May 2019.27 M. S. Scheurer and R.-J. Slager, “Unsupervised machine learning and band topology,” arXiv preprint arXiv:2001.01711 , 2020.28 P. Zhang, H. Shen, and H. Zhai, “Machine learning topological invariants with neural networks,”

Physical review letters , vol. 120, no. 6, p. 066401, 2018.29 G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborov´a, “Machine learning and the physical sciences,”

Rev.Mod. Phys. , vol. 91, p. 045002, Dec 2019.30 M. R. Carbone, S. Yoo, M. Topsakal, and D. Lu, “Classiﬁcation of local chemical environments from X-ray absorption spectra using supervised machinelearning,”

Phys. Rev. Materials , vol. 3, p. 033604, Mar 2019.31 A. Cui, K. Jiang, M. Jiang, L. Shang, L. Zhu, Z. Hu, G. Xu, and J. Chu, “Decoding phases of matter by machine-learning Raman spectroscopy,”

Phys. Rev.Applied , vol. 12, p. 054049, Nov 2019.32 B. Han, Y. Lin, Y. Yang, N. Mao, W. Li, H. Wang, V. Fatemi, L. Zhou, J. I.-J. Wang, Q. Ma, et al. , “Deep learning enabled fast optical characterization oftwo-dimensional materials,” arXiv preprint arXiv:1906.11220 , 2019.33 A. M. Samarakoon, K. Barros, Y. W. Li, M. Eisenbach, Q. Zhang, F. Ye, Z. Dun, H. Zhou, S. A. Grigera, C. D. Batista, et al. , “Machine learning assistedinsight to spin ice dy2ti2o7,” arXiv preprint arXiv:1906.11275 , 2019.34 Y. Zhang, A. Mesaros, K. Fujita, S. Edkins, M. Hamidian, K. Chng, H. Eisaki, S. Uchida, J. S. Davis, E. Khatami, et al. , “Machine learning in electronic-quantum-matter imaging experiments,”

Nature , vol. 570, no. 7762, pp. 484–490, 2019.35 B. S. Rem, N. K¨aming, M. Tarnowski, L. Asteria, N. Fl¨aschner, C. Becker, K. Sengstock, and C. Weitenberg, “Identifying quantum phase transitions usingartiﬁcial neural networks on experimental data,”

Nature Physics , vol. 15, no. 9, pp. 917–920, 2019.36 G. Bergerhoff and I. Brown, “Crystallographic databases. FH Allen et al.(Hrsg.) Chester,”

International Union of Crystallography , 1987.37 A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K. a. Persson, “The Materials Project: Amaterials genome approach to accelerating materials innovation,”

APL Materials , vol. 1, no. 1, p. 011002, 2013.38 C. Zheng, K. Mathew, C. Chen, Y. Chen, H. Tang, A. Dozier, J. J. Kas, F. D. Vila, J. J. Rehr, L. F. J. Piper, K. A. Persson, and S. P. Ong, “Automated generationand ensemble-learned matching of X-ray absorption spectra,” npj Computational Materials , vol. 4, p. 12, 03 2018. | Computational Materials Science , vol. 68, pp. 314–319, Feb. 2013.40 S. P. Ong, S. Cholia, A. Jain, M. Brafman, D. Gunter, G. Ceder, and K. A. Persson, “The Materials Application Programming Interface (API): A simple, ﬂexibleand efﬁcient API for materials data based on REpresentational State Transfer (REST) principles,”

Computational Materials Science , vol. 97, pp. 209–215, feb2015.41 X. Meng and G. E. Karniadakis, “A composite neural network that learns from multi-ﬁdelity data: Application to function approximation and inverse PDEproblems,”

Journal of Computational Physics , vol. 401, p. 109020, 2020.42 A. Agarwala and V. B. Shenoy, “Topological insulators in amorphous systems,”

Phys. Rev. Lett. , vol. 118, p. 236402, Jun 2017.43 E. Prodan, “Disordered topological insulators: a non-commutative geometry perspective,”

Journal of Physics A: Mathematical and Theoretical , vol. 44, no. 11,p. 113001, 2011. | ethods Data acquisition and assembly

To assemble the training and evaluation datasets, the entries of 68 ,

347 materials in the Inorganic Crystal Structure Database (ICSD) weregathered subject to a high-quality ﬁltering procedure based on that outlined in the topological materials database used for this study . Speciﬁcally,only ICSD entries marked as “high quality” that included structural reﬁnement, temperature factors, pressure in the range of 0 . . ,

354 materials in this datasetare classiﬁed as topological by the topological quantum chemistry formalism . The database of computed K-edge XANES spectra distributedon the Materials Project was used to gather XAS data, which are calculated using the Green’s function formulation of the multiple scatteringtheory implemented in FEFF . Since the Materials Project is a dynamic and evolving platform, we note that our Materials Project queries wereperformed on the most recent release at the time of this study, v.2019.05. All samples with unique entries available on the Materials Projectdatabase were thus identiﬁed, resulting in 24 ,

538 examples of which 4 ,

434 were classiﬁed as topological, and 20 ,

104 as trivial. Finally, weretained only those samples for which XANES spectra were available for every element of the compound, giving us the ﬁnal set of 16 , ,

121 topological, 13 ,

337 trivial.

Data preparation and pre-processing

To ensure balanced representation of each absorbing element within each data subset (training, validation, and testing), we carried out thefollowing assignment procedure when constructing the three data subsets. We traversed all 81 absorbing elements in order of increasingabundance in topological samples (see

Fig. S1(a) ). For the ﬁrst element considered, all samples containing that element were disaggregated byclass and randomly shufﬂed within each class subset. The desired fractions of testing and validation examples (20% and 10%, respectively) werethen allocated from each class, reserving the rest (70%) for training. This process was repeated for subsequent elements while excluding anysamples already assigned in a previous step. The balanced representation of topological examples for each element achieved by this process isconﬁrmed in

Fig. S1(c) .For each sample, the computed XANES spectra of each absorbing element were interpolated and re-sampled at 200 evenly-spaced energyvalues. The complete spectrum for a single sample was stored as a two-dimensional array of 81 columns, with non-zero values in the elementchannels of that sample’s constituent atoms. Each of the 81 element channels was independently standardized prior to being input into themachine learning algorithm, since the K-edges of different elements are found over a broad range of energy scales. Standardization entailedcentering the mean of spectral intensity averages over each energy range, and scaling by the mean of intensity standard deviations.The ﬁnal model presented in this work additionally introduced symmetry group information as an input. As noted in the main text, thesymmetry group of a given sample was encoded as three one-hot vectors: The ﬁrst is a vector of length 7 with a single non-zero indicator of thecrystal system, the second a vector of length 32 indicating the point group, and the third a vector of length 230 indicating the space group.During training, a different weight, or penalty for misclassiﬁcation, was computed for each training sample to correct for imbalancedrepresentation in topological class and composition. The sample weight of a compound was deﬁned as inversely proportional to the product of itstopological class abundance and abundance of its least occurring constituent element. All sample weights were then uniformly normalized toachieve a mean of 1.

Network implementation

The neural network-based models presented in this work were implemented in Python using the TensorFlow and Keras libraries. Indeveloping the network architecture, several network design choices were made to constrain the number of model parameters and combatoverﬁtting: (1) Convolution kernels learned during training were shared across XANES spectra of all elements, (2) regularization was applied tokernel weights, and (3) dropout was added to the largest fully-connected layer. The number of convolutional layers, regularization strength,dropout rate, kernel size, and hidden layer dimensions were systematically tuned to achieve a network economical in the number of parameters,yet expressive enough to learn clear patterns from considerably diverse data. The models were trained on a Quadro RTX 6000 graphics processingunit (GPU) with 24GB of random access memory (RAM). Optimization was performed using the Adam optimizer to minimize the binarycross-entropy loss as is common for binary classiﬁcation problems.

44 J. J. Rehr, J. J. Kas, F. D. Vila, M. P. Prange, and K. Jorissen, “Parameter-free calculations of X-ray spectra with FEFF9,”

Physical Chemistry ChemicalPhysics , vol. 12, no. 21, pp. 5503–5513, 2010.45 M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow:Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorﬂow.org.46 F. Chollet et al. , “Keras.” https://keras.io , 2015. | cknowledgments N.A. acknowledges National Science Foundation GRFP support under Grant No. 1122374. J.A. acknowledges National Science FoundationGRFP support under Grant No. DGE-1745303. N.A. and M.L. acknowledge the support from U.S. Department of Energy (DOE), Ofﬁce of Science (SC), BasicEnergy Sciences (BES), Award No. DE-SC0020148. C.H.R. was partially supported by the Applied Mathematics Program of the U.S. DOE Ofﬁce of ScienceAdvanced Scientiﬁc Computing Research under Contract No. DE-AC02-05CH11231.

Author contributions

M.L. conceived the study. C.H.R. and M.L. supervised the project. N.A. and J.A. curated the data. J.A. pre-processed the data andperformed the principal component analysis. N.A. and J.A. devised the machine learning problem, optimized the network architecture, and analyzed the results.N.A. and J.A. wrote the manuscript, with input from all authors.

Competing interests

The authors declare no competing interests.

Additional information

Supplementary Information is available for this paper. Correspondence and requests for materials should be addressed to Mingda Li([email protected]) or Chris H. Rycroft ([email protected]).

Data availability

All the data and code supporting the ﬁndings are available from the corresponding authors upon reasonable request. | upplementary Information Fig. S1 Element and topological class frequencies in the dataset. (a) The total number of topological samples across training, validation,and testing data containing each element. Each element’s entry includes its atomic number, atomic symbol, and number of samples, and iscolored by the number of samples. (b) The total number of trivial samples by element. (c) The fraction of topological samples, by element, in thetraining, validation, and testing sets. The data subdivision reﬂects a balanced representation of absorbing elements and topological class acrossthe datasets. || ig. S2 Additional principal component projections for selected elements. (a) Additional decision boundary visualizations ofclassiﬁcations by unsupervised k -means clustering for selected elements. (b) Classiﬁcation performance of unsupervised k -means clusteringaccording to recall, precision, and F metrics. | ig. S3 Support vector machine (SVM) model performance. (a) Schematic of the combined spectral and symmetry group inputs to atraditional support vector machine (SVM). (b) Overall recall, precision, and F scores for topological and trivial classes. Element speciﬁc recall(c-d), precision (e-f), and F (g-h) scores for topological and trivial examples, respectively. || ig. S4 Convolutional neural network (CNN) classiﬁer performance with spectral input. (a) Schematic of the neural network architecturewith only a spectral input (no symmetry group information). (b) Overall recall, precision, and F scores for topological and trivial classes.Element speciﬁc recall (c-d), precision (e-f), and F (g-h) scores for topological and trivial examples, respectively. | ig. S5 Convolutional neural network (CNN) classiﬁer performance on trivial examples with combined spectral and symmetry groupinput. (a) The recall of the trivial class, computed independently for the subsets of test data containing each absorbing element. Each elementsentry lists its atomic number, atomic symbol, and recall, and is colored according to the recall. (b) The corresponding precision and (c) F scorefor each absorbing elements subsamples. (d) Loss (left) and accuracy (right) of the training and validation sets as a function of epochs duringtraining of the machine learning model. Note that the accuracy is artiﬁcially lower during training due to the presence of dropout and thecalculation of accuracy as a running average over data batches. This is recovered during testing when dropout is turned off and the full data areevaluated on the optimized model. ||