[PDF] CCDCGAN: Inverse design of crystal structures

Abstract

Autonomous materials discovery with desired properties is one of the ultimate goals for modern materials science. Applying the deep learning techniques, we have developed a generative model which can predict distinct stable crystal structures by optimizing the formation energy in the latent space. It is demonstrated that the optimization of physical properties can be integrated into the generative model as on-top screening or backwards propagator, both with their own advantages. Applying the generative models on the binary Bi-Se system reveals that distinct crystal structures can be obtained covering the whole composition range, and the phases on the convex hull can be reproduced after the generated structures are fully relaxed to the equilibrium. The method can be extended to multicomponent systems for multi-objective optimization, which paves the way to achieve the inverse design of materials with optimal properties.

Full PDF

CCCDCGAN: Inverse design of crystal structures

Teng Long*, Nuno M. Fortunato, Ingo Opahle, Yixuan Zhang, Ilias Samathrakis, Chen Shen, Oliver Gutfleisch, Hongbin Zhang* Institute of Materials Science, Technische Universität of Darmstadt, Darmstadt 64287, Germany *Corresponding to Hongbin Zhang, [email protected] or Teng Long, [email protected]

Abstract

Introduction

The discovery and exploitation of new materials has enormous benefits for the welfare of society and technological revolutions , which motivates the launch of Materials Genome Initiative in 2011 . Till now, high-throughput (HTP) workflows based on density functional theory (DFT) enable massive calculations on existing and potentially novel compounds, accelerating materials discovery dramatically . For instance, crystal structure predictions can be performed based on brute force substitution of the known prototypes or the evolutionary algorithms as implemented in CALYPSO and USPEX . Nevertheless, the soaring computation cost prevents exhaustive screening over immense phase space, limiting the application of such methods. On the other hand, the emergent machine learning techniques have become the fourth paradigm for materials science , as exemplified by the rapid developments and applications in the last decades . Particularly, the machine learning methods have been utilized to discover novel aterials, e.g., applying neural network to assist molecule design . Two particularly powerful methods therein are the variational autoencoder (VAE) and generative adversarial network (GAN) , resulting in significant progresses on molecule design . Combined with the successful forward modeling of physical properties , generative learning on the existed structures enables inverse design, i.e. , prediction of new structures with desired properties. Unfortunately, unlike molecules which can be represented using simplified molecular input line entry specification (SMILES) , inverse design of three-dimensional (3D) crystal structures has been rare due to the challenges in obtaining a continuous representation in the latent space for deep learning . Recently, Nouira et al. developed a CrystalGAN model to design ternary Ni-Pd-H phases starting from binary Ni-H and Pd-H structures using vector based representation , which is restricted in extending to general chemical composition and elemental composition. Such a constraint can be ameliorated using the two-dimensional (2D) continuous representation generated by the voxel and autoencoder methods . For instance, Noh et al. developed the iMatGen model to generate novel V x O y using VAE . The ZeoGAN scheme developed by B. Kim et al. is based on the same representation applied on zeolites, where Wasserstein GAN (WGAN) is used to generate porous materials . S. Kim et al. also developed the Crystal-WGAN to design new Mn-Mg-O system from such a continuous representation . These models exhibit excellent abilities in generating new structures, but not all predicted phases host the desired functionality. For example, only ZeoGAN integrates the optimization of heat of adsorption, whereas the other schemes exert mostly constraints as a filter on the generated data. It is noted that by nature, inverse design entails the optimization of properties in the latent space, and thus internalizes the desired property as a joint objective of the generator . In this work, we developed a GAN-based inverse design framework for crystal structure prediction with target properties and applied it on the binary Bi-Se system. It is firstly demonstrated that deep convolutional generative adversarial network (DCGAN) can be applied to generate new crystal structures without applying any constraint . Taking formation energy as the target property, optimization is integrated into the DCGAN model in two schemes: DCGAN + constraint to select target structures following the conventional screening method, and constrained crystals deep convolutional generative adversarial network (CCDCGAN) with an extra feedback loop for automatic optimization. The performance of the DCGAN + constraint model and CCDCGAN are comparatively evaluated, and it is demonstrated that CCDCGAN is more efficient in generating new stable structures, while the DCGAN + constraint model has a higher generation rate of meta-stable structures. Both schemes can be straightforwardly generalized to perform multi-objective inverse design for multicomponent systems , hence pave the way to realize autonomous inverse design of new crystal phases with target properties. Methods

Data

Typically, DCGAN requires 10 known crystal structures as positive examples during training. owever, there are mostly tens of known crystalline phases for a given binary system, e.g. , Materials Projects (MP) provides only 17 already known Bi x Se y materials, which are not sufficient for deep learning. Following Ref. , we constructed 10981 artificial structures eliminating the large unit cell, i.e. the maximum number of atoms in unit cell less than 20 and the maximum length of unit cell smaller than 10 Angstrom, and we do the substitution based on their result. of Bi x Se y compounds by substituting all binary materials in MP, and follow-up HTP DFT calculations. We use the high-throughput environment to optimize the structure and determine the thermodynamical stability of Bi-Se database and the generated structures, where thermodynamical stability is evaluated by calculation of the formation energy. All data in Bi-Se database are considered to calculate the convex hull. For VASP setting, Perdew−Burke−Ernzerhof (PBE) is adopted, the energy cutoff is 500 eV, and K space density is 40 on each direction for the Brillouin zone.

Representation

Although crystallographic and chemical information are stored in the crystallographic information file (CIF), the discontinuous and heterogeneous formats are not a suitable choice of representation in a generative model, thus a continuous and homogeneous representation including both the chemical and structural information is required. Following Ref. , the lattice constants and atomic positions are translated into the voxel space, followed by encoding into a 2D crystal graph through autoencoder , as demonstrated in Fig. 1(A). In this way, a continuous latent space is constructed. The whole process is reversible, i.e., a random 2D crystal graph can be reconstructed into a crystal structure in real space, which is essential for a generative model. Specifically, 3D voxel grids are used for a typical binary compound: two grids to record the atomic positions of two elements separately and the third one to store the lattice constants, i.e., the lengths and angles of/between them. Hereafter such grids are referred as site voxel and lattice voxel, respectively. Using the probability density based on Gaussian functions, the lattice voxel is defined as

2, ,2

2, , x y z rx y z p e   , (1) where x, y, z denotes the grid index, p is the probability density of the denoted grid, r is the real space distance between this grid to the center of unit cell, and σ denotes the standard deviation of the Gaussian function. Similarly, sites voxel is defined as

2, , ,2

2, , 1 n x y z rnx y z i p e    , (2) where n is number of atoms and r is the real space distance between this grid point and the i th atom. In this way, the inverse transformation is trivial for the lattice voxel while that for the sites voxel relies on the image filter technique . Autoencoder is a type of artificial neural network which is able to encode inputs into low dimension vectors, as well as decode the vectors back . In this work, the autoencoder is applied o encode 3D voxel data into one-dimensional (1D) vectors in the latent space, which is realized based on 3D convolutional neural network (CNN) autoencoder. The loss function of autoencoder consists of two parts: the first is the loss of information, i.e., the difference between inputs and outputs, and the second is the regulation term to prevent over fitting, which yields loss x y       , (3) where x and y are the input and output of the autoencoder, λ is regulation coefficient, and ω is weight in the autoencoder . The construction of autoencoder for both site and lattice voxel are the same, where both are trained using a 90%/10% training/test set ratio. Both site and lattice voxel autoencoders have high reconstruction ratio. All of them are considered to be identical by the crystal structure comparison subroutine available in pymatgen , with fractional length tolerance being 0.2, sites tolerance 0.3, and angle tolerance 5 degrees. For data transformation between real space and voxel space, the voxel used for atomic positions is 64×64×64, while the voxel for lattice parameters is 32×32×32 and the σ used in Gaussian function is always 0.15, function “gaussian_filter” from scipy package is used. The autoencoders constructed by tensorflow in python. The sites autoencoder consists of 10 3D convolutional layers, i.e. is 0.9 and β is 0.99 . Similar, the lattice autoencoder consists of only 8 3D convolutional layers (4 for encoder and 4 for decoder), other parameters are same to the previous one . Detailed design of these two models can be found in Table S1 and Table S2. DCGAN

Turning now to the realization of generative models, VAE and GAN are two most popular algorithms. VAE is a mutation of autoencoder discussed above, which assumes a specific (such as Gaussian) distribution of data (in our case 2D crystal graphs) in the latent space. To be generative, such a distribution function should be defined properly, i.e., to be consistent with the distribution of 2D crystal graphs of the known crystal structures in the latent space. In addition, the distribution function has to be specified prior to the training process and its form determines the performance of the generative model, which demands domain expertise in statistics and profound understanding of the input data. For instance, a recent work by Ren et al. used sigmoid function in VAE to design inorganic crystals and find novel structures that do not exist in training set . In contrast, the GAN model trains two mutual-competitive neural networks ( i.e. , generator and discriminator) to generate new data statistically the same as the training data without assuming the distribution function . That is, the discriminator tries to distinguish the generated data from the training data, while the generator attempts to fool the discriminator by generating data similar to the training data. Mathematically the objective of GAN is defined by the following equation:     max min x g x p x pDG E D x E D x           , (4) where D is the discriminator, G is the generator, E means expectation value, x is the original data, D(x) is the output of discriminator with x as input, p x is the possibility density function of the original data, while p g is the possibility density function of the generated data. Through the competition between the generator and discriminator, the distribution of generated data becomes hardly distinguishable with the distribution of training data. Compared with VAE, GAN does not require specification of the distribution function in the latent space ( p g ), which makes the generation process more robust, and thus is adopted in this work . The DCGAN model consists of generator and discriminator, they both use Adam optimizer with 0.002 learning rate, β is 0.5 and β is 0.999. 2D convolutional layer, dropout and batch normalization are used in both model, while RELU, leakyRELU, tanh and sigmoid are used as activation functions. 1,000,000 steps are trained for this model for 16 hours on Quadro P2000 GPU. Desgin of generator and discriminator are listed in Table S3 and Table S4 respectively. Latent space of the GAN model has 200 dimensions. Constraint

As mentioned above, the objective of inverse design is to design compounds with desired properties, including thermodynamical, mechanical, and functional properties. In this work, we take the formation energy ( i.e. , thermodynamic stability) as the target property. In order to be able to optimize the formation energy in the latent space, another convolutional neural network (CNN) model is trained taking 2D crystal graphs in the latent space as inputs and formation energies as the output physical property. The training is carried out by a 90%/10% training/test-set ratio over the Bi-Se database. The constraint model consists of 4 convolutional 2D layer and 6 fully connected layer, conducted by keras. It also use Adam optimizer, learning rate 0.002, β is 0.5 and β is 0.999, optimization loss is mean square error, activation function is leakyRELU. To prevent overfitting, dropout and batch normalization are used after each convolutional layer. Detailed design of this model is described in Table S5.

DCGAN + constraint

A straightforward way to implement the optimization of physical properties, e.g. , formation energy in this work, is to apply an add-on constraint on the generated structures using DCGAN, as sketched in Fig. 2(A). The advantage doing so is that there is no need to train another model, which saves training time. Additionally, all the existing machine learning models can be transplanted no matter whether such forward predictions can take 2D crystal graphs in the latent space or vector based chemical and structural information as descriptors. This allows the optimization of a wide spectrum of physical properties . Nevertheless, this method is essentially a selection of the structures generated by DCGAN, thus it cannot search automatically for a specific region in the latent space to reach the local optimal values. More details are described in upplementary Section S5. CCDCGAN

The constraint can also be integrated into DCGAN as a back propagator, as illustrated as CCDCGAN in Fig. 3(A), to realize automated optimization in the latent space so that inverse design can be realized. In this way, the constraint gives rise to an additional optimization objective of the generator, which can be mathematically described as:     min x z p f gG E D z E z z p      , (5) where E f is the formation energy predicted by the constraint model, z is the generated 2D crystal graph, and ω is defined as the weight of formation energy loss. Note that such an additional optimization objective cannot outweigh the primary objective, leading to lower weight for the formation energy loss (0.1 in this work) than the discriminator loss. Unlike the DCGAN + constraint model, CCDCGAN can accomplish automated searching for the local minima in the latent space and thus improve the efficiency of discovering new stable structures. It is noted that CCDCGAN requires a training from scratch, which takes 18 hours on a Quadro P2000 GPU machine, which is 2 hours longer than the DCGAN + constraint model. Please refer to Supplementary Section S5 for more details. The generator, discriminator and constraint have exactly the same design as the previous model, all parameters are same as parameters in DCGAN. The only difference is weight of discriminator remains as 1 while the weight of the formation energy loss is 0.1. The training time is about 18 hours under the same condition as DCGAN. Results

Data

HTP DFT calculations leads to 9810 compounds with successful crystal structure optimization . Correspondingly, the formation energies obtained are shown in Fig. S1. Obviously, the whole composition range is covered. However, only 155 (1.56%) out of all the structures are stable (formation energy lower than 0 eV/atom and distance to the convex hull smaller than 100 meV/atom) and 707 (7.2%) cases are meta-stable (formation energy lower than 0 eV/atom and convex hull distance smaller than 150 meV/atom). Interestingly, there is only one phase, i.e. , Bi Se , on the convex hull. Such a database is referred as the Bi-Se database in the rest of the manuscript. epresentation When applied to the Bi-Se database, 9420 of 9810 crystal structures are successfully reconstructed. Such a high ratio is not caused by overfitting, demonstrated by the learning curve of both autoencoders as shown in Fig. 1(B) and (C). The differences of training set loss and test set loss are both negligible, suggesting that the 2D crystal graphs in the latent space contain adequate information to reconstruct crystal structures. This is confirmed by the diversity of twelve typical 2D crystal graphs in the latent space (cf. Fig. 1(D) for four cases). Please refer to Supplementary Section S2 for more details.

Fig. 1. Generation of 2D crystal graphs. (A) Schematic diagram of generating 2D crystal graphs; (B) Learning curve of lattice autoencoder; (C) Learning curve of sites autoencoder; (D) 4 typical crystal structures and their corresponding 2D crystal graphs.

DCGAN

Such an implementation of DCGAN can generate crystal structures with a high success rate (number of generated crystals over number of generated 2D crystal graphs), e.g. , 2397 crystal structures are transformed from 13000 generated 2D crystal graphs and relaxed by the DFT calculation. Most (~90%) discarded cases fail due to a rigid constraint on the interatomic distance where the minimum interatomic distance should be larger than 2.0 Angstrom, which is set based on the known structures in the Bi-Se database. The generated structures cover a large composition range as shown in Fig. 2(B), where the red bullets denote the original data in Bi-Se database and the blue triangles mark the newly generated structures by DCGAN. One interesting observation is that the generated structures are mostly with negative formation energies, on average lower than that of the original database (Fig. S2(A)). For instance, 2183 structures out of 2397 have negative formation energy, which consists of 91.1% of the generated structures, whereas only 46.8% (4588 out of 9810) of the Bi-Se database have negative formation energies. n this regard, despite the DCGAN model has not been fed with the formation energies, it tends to be biased to data with higher density, e.g. , 4588 structures with formation energies within [-0.4, 0] eV/atom are considered to be more typical than 5222 structures with formation energies within [0, 2.5] eV/atom. Out of 2397 generated structures, it is found that 1110 of them are meta-stable, and 21 are stable.

Fig. 2. DCGAN + constraint model. (A) Schematic diagram of DCGAN + constraint; (B) Scatter plot of formation energy vs. composition, red points represent Bi-Se system, blue triangles are DFT calculated DCGAN generated data, and cyan triangles are the DFT calculated data of DCGAN + constraint model; (C) 2D graph of latent space; (D) Learning curve of the training process.

313 out of the 2397 structures are considered to be distinct structures, i.e. , different from those in the Bi-Se database and different from each other, leading to 47 (6) meta-stable (stable) structures correspondingly. For instance, Bi Se and BiSe shown in Fig. S2(B) and (C) are the automatic generated structures, whose formation energies (distances to the convex hull) are -0.192 (0.136) eV/atom and -0.135 (0.111) eV/atom, respectively. Furthermore, DCGAN explores a much larger phase space when generating new structures. As illustrated in Fig. 2(C), the structures in the Bi-Se database are concentrated in a small region in the latent space, whereas much larger phase space is covered by DCGAN. Thus, DCGAN can generate distinct crystal structures beyond the phase space of the known crystal structures. Constraint

The resulting R score and mean absolute error (MAE) being about 85% and 0.019 eV/atom (Fig. S3(A)), respectively. Such a MAE is even smaller than that (0.021 eV/atom) obtained using the state-of-the-art CGCNN model . That is, the 2D crystal graphs in the latent space can be considered as effective descriptors to perform forward prediction of physical properties. We note that overfitting is a marginal issue for such a high accuracy, as seen in the learning curve of the training process in Fig. 2(D). Details on the CNN model are described in Supplementary Section 4. DCGAN + constraint

Applying the constraint on the formation energy (with a tolerance of 0.3 eV/atom) on 2397 generated structures by DCGAN, 1998 crystal structures are selected and optimized by further DFT calculations. Such a screening procedure does not affect the diversity of the composition as shown in Fig. S2(D). The application of such a constraint has evident effect in the latent space, as demonstrated in Movie. S1 where the constraint screens the unwanted points out leading to a shrunk region in the latent space. With such a constraint applied, the ratio of structures with negative formation energy is higher, increasing from 91.1% for DCGAN to 94.4% (1887/1998). The ratio of meta-stable structures also becomes higher compared to the DCGAN case, i.e. , 1028 meta-stable structures are generated from 1887 structures, reaching the ratio of 54.4% which is better than 48.4% in the DCGAN model. However, the number of generated stable structures reduces to 8 due to the selection process. An unexpected observation is that applying the constraint jeopardizes the generation of distinct structures. After performing DFT calculations, only 151 distinct structures remain, which are significantly reduced compared with the previous 313 cases obtained by DCGAN. The reason is two-fold. On the one hand, the generated crystal structures by DCGAN are not guaranteed to be at the mechanical and dynamical equilibrium, i.e., the lattice constants and atomic positions will change during DFT relaxation. On the other hand, the CNN model applied as constraint is not good enough in predicting formation energy, though the MAE for the formation energy is only about 0.019 eV/atom. Thus, 344 candidates have been screened out by considering a tolerance of 0.3 eV/atom. After comparing with the known crystal structures in the Bi-Se database, 84 distinct structures remain with negative formation energy, 35 distinct meta-stable structures and 2 distinct stable structures generated. These structures are still with high quality, e.g. , two of them Bi Se and Bi Se are demonstrated in Fig. S2(E) and Fig. S2(F), with the formation energies (distances to the convex hull) being -0.065 (0.099) eV/atom and -0.202 (0.126) eV/atom, respectively. Overall, generated structures via DCGAN + constraint have a higher ratio of meta-stable structures than DCGAN but lower ratio of distinct structures. CCDCGAN

In comparison with DCGAN, CCDCGAN has a higher success rate of generating crystal structures, i.e. , 3106 crystal structures are transformed successfully from 13000 generated 2D crystal graphs. Same as the DCGAN model, the generated structures cover a large composition range, especially at the Bi rich part centered at Bi Se (Fig. 3(B)). The average formation energy (-0.0895 eV/atom) for the CCDCGAN generated structures (Fig. S2(G)) is noticeably lower than that from the DCGAN (-0.0788 eV/atom, Fig. S2(A)) and DCGAN + constraint (-0.0868 eV/atom, Fig. S2(D)). This suggests that the back propagation indeed causes optimization in the latent space. Correspondingly, the number of structures which are with negative formation energy, meta-stable, and stable is 2918, 1417, and 21, respectively.

Fig. 3. CCDCGAN model. (A) Schematic diagram of CCDCGAN model; (B) Scatter plot of formation energy vs. composition, green points are the DFT calculated data of the generated data; (C) formation energy of Bi Se generated by CCDCGAN; (D) formation energy of Bi Se generated by CCDCGAN with Bi Se excluded in training set; (E) crystal structure of generated Bi Se ; (F) crystal structure of generated Bi Se ; (G) crystal structure of generated Bi Se . The most intriguing result is that the number of distinct structures is bigger than that from DCGAN or DCGAN + constraint, where 373 distinct structures have been identified after performing DFT relaxations on 3106 structures. Correspondingly, the number of distinct meta-stable and stable structures is 90 and 5, respectively. Three particularly promising structures with formation energy lower than their counterparts in Bi-Se database are Bi Se (Fig. 3(E)), Bi Se (Fig. 3(F)), and Bi Se (Fig. 3(G)), whose formation energies (distance to the convex hull) are -0.287 (0.041) eV/atom, -0.139 (0.069) eV/atom, and -0.063 (0.101) eV/atom, respectively. In this regard, CCDCGAN performs better than both the DCGAN + constraint and DCGAN models in generating distinct structures. As seen from Fig. 2(B), and Fig. 3(B), the generated distinct phases seem to be all above the convex hull and there is no new phase on or below the convex hull defined by the Bi Se phase. Particularly, around the Bi Se composition, there is a forbidden region with no distinct phase generated. To find out the reason and also to explore the full competence of the CCDCGAN model, we generated one million crystal structures and did DFT calculation on 1278 generated structures with the Bi Se composition, resulting in 236 distinct structures. As shown in Fig. 3(C), the whole range of formation energies can be covered, with the smallest distance to the convex hull for the generated structures being only 0.0151 eV/atom, which confirms the ability of generating structures on the convex hull. In this regard, the forbidden region close to the convex hull is originated from limited number of generation and also the fact that the generated structures are not at their mechanical and dynamical equilibria. Furthermore, we trained another CDCGAN model with all the phases with the Bi Se composition excluded in the training set, resulting 233 of the 1612 generated Bi Se structures are the same as those in the Bi-Se database (Fig. 3(D)). More importantly, the case with the smallest convex hull being only 0.0005 eV/atom with the corresponding formation energy being -0.3935 eV/atom) has the same structure as the only stable Bi Se phase. This proves the capability of the CCDCGAN model to generate experimental synthesizable structures in unknown composition range, and thus to accelerate the discovery of new crystal phases. Nevertheless, additional DFT calculations are mandatory in order to relax the crystal structures, so that distinct structures on or below the current convex hull can be identified. Particularly for the Bi-Se system, a complete evaluation of the convex hull by performing explicit DFT calculations on one million generated structures covering the whole composition range is still numerically expensive. We suspect that the recently developed machine learning interatomic potential will be helpful for the optimization of the generated crystal structures with the chemical accuracy . Conclusion

We have developed an inverse design framework CCDCGAN consisting of generator, discriminator and constraint and successfully applied it to design unreported crystal structures with low formation energy for the binary Bi-Se system. It is demonstrated that 2D crystal graphs can be used to construct a latent space with continuous representation of the known crystal structures, which serve as effective descriptors for modeling the physical properties ( e.g. , formation energy) and can be decoded into real space crystal structures enabling the generation of new crystal structures. Such an inverse design model can be easily generalized to the other compositions and the multicomponent cases. Importantly, we elucidate that the optimization of physical properties ( e.g. , formation energy) can be integrated into the generative deep learning model as explicit constraints or back propagators. This allows further development of a multi-objective inverse design framework to optimize other physical properties. One apparent challenge is how to get the generated structures into their mechanical and dynamical equilibria, entailing further exploration such as relaxation using the machine learning interatomic potential.

Reference (1) Sanchez-Lengeling, B.; Aspuru-Guzik, A. Inverse Molecular Design Using Machine Learning: Generative Models for Matter Engineering.

Science , (6400), 360–365. https://doi.org/10.1126/science.aat2663. (2) de Pablo, J. J.; Jackson, N. E.; Webb, M. A.; Chen, L.-Q.; Moore, J. E.; Morgan, D.; Jacobs, R.; Pollock, T.; Schlom, D. G.; Toberer, E. S.; Analytis, J.; Dabo, I.; DeLongchamp, D. M.; Fiete, G. A.; Grason, G. M.; Hautier, G.; Mo, Y.; Rajan, K.; Reed, E. J.; Rodriguez, E.; Stevanovic, V.; Suntivich, J.; Thornton, K.; Zhao, J.-C. New Frontiers for the Materials Genome Initiative. npj Computational Materials , (1), 1–23. https://doi.org/10.1038/s41524-019-0173-4. 3) Cabinet-level National Science and Technology Council. Materials Genome Initiative for Global Competitiveness. white paper June 2011. (4) Saal, J. E.; Kirklin, S.; Aykol, M.; Meredig, B.; Wolverton, C. Materials Design and Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials Database (OQMD). JOM , (11), 1501–1509. https://doi.org/10.1007/s11837-013-0755-4. (5) Wang, Y.; Lv, J.; Zhu, L.; Ma, Y. CALYPSO: A Method for Crystal Structure Prediction. Computer Physics Communications , , 2063–2070. https://doi.org/10.1016/j.cpc.2012.05.008. (6) Glass, C. W.; Oganov, A. R.; Hansen, N. USPEX—Evolutionary Crystal Structure Prediction. Computer Physics Communications , (11), 713–720. https://doi.org/10.1016/j.cpc.2006.07.020. (7) Hey, T.; Tansley, S.; Tolle, K. The Fourth Paradigm: Data-Intensive Scientific Discovery ; 2009. (8) LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning.

Nature , (7553), 436–444. https://doi.org/10.1038/nature14539. (9) Pyzer‐Knapp, E. O.; Li, K.; Aspuru‐Guzik, A. Learning from the Harvard Clean Energy Project: The Use of Neural Networks to Accelerate Materials Discovery. Advanced Functional Materials , (41), 6495–6502. https://doi.org/10.1002/adfm.201501919. (10) Ryan, K.; Lengyel, J.; Shatruk, M. Crystal Structure Prediction via Deep Learning. J. Am. Chem. Soc. , (32), 10158–10168. https://doi.org/10.1021/jacs.8b03913. (11) Kingma, D. P.; Welling, M. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat] . (12) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27 ; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., Weinberger, K. Q., Eds.; Curran Associates, Inc., 2014; pp 2672–2680. (13) Schwalbe-Koda, D.; Gómez-Bombarelli, R. Generative Models for Automatic Chemical Design. arXiv:1907.01632 [physics, stat] . (14) Schmidt, J.; Marques, M. R. G.; Botti, S.; Marques, M. A. L. Recent Advances and Applications of Machine Learning in Solid-State Materials Science. npj Computational Materials , (1), 1–36. https://doi.org/10.1038/s41524-019-0221-0. (15) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. , (2), 268–276. https://doi.org/10.1021/acscentsci.7b00572. (16) Anderson, E. SMILES, a Line Notation and Computerized Interpreter for Chemical Structures ; U.S. Environmental Protection Agency, Environmental Research Laboratory, 1987. (17) Nouira, A.; Sokolovska, N.; Crivello, J.-C. CrystalGAN: Learning to Discover Crystallographic Structures with Generative Adversarial Networks. arXiv:1810.11203 [cs, stat] . (18) Hoffmann, J.; Maestrati, L.; Sawada, Y.; Tang, J.; Sellier, J. M.; Bengio, Y. Data-Driven Approach to Encoding and Decoding 3-D Crystal Structures. arXiv:1909.00949 [cond-mat, physics:physics, stat] . 19) De, S.; P. Bartók, A.; Csányi, G.; Ceriotti, M. Comparing Molecules and Solids across Structural and Alchemical Space.

Physical Chemistry Chemical Physics , (20), 13754–13769. https://doi.org/10.1039/C6CP00415F. (20) Kaufmann, K.; Zhu, C.; Rosengarten, A. S.; Maryanovsky, D.; Harrington, T. J.; Marin, E.; Vecchio, K. S. Paradigm Shift in Electron-Based Crystallography via Machine Learning. arXiv:1902.03682 [cond-mat] . (21) Noh, J.; Kim, J.; Stein, H. S.; Sanchez-Lengeling, B.; Gregoire, J. M.; Aspuru-Guzik, A.; Jung, Y. Inverse Design of Solid-State Materials via a Continuous Representation. Matter , S2590238519301754. https://doi.org/10.1016/j.matt.2019.08.017. (22) Kim, B.; Lee, S.; Kim, J. Inverse Design of Porous Materials Using Artificial Neural Networks.

Science Advances , (1), eaax9324. https://doi.org/10.1126/sciadv.aax9324. (23) Kim, S.; Noh, J.; Gu, G. H.; Aspuru-Guzik, A.; Jung, Y. Generative Adversarial Networks for Crystal Structure Prediction. arXiv:2004.01396 [cond-mat, physics:physics] . (24) Bojanowski, P.; Joulin, A.; Lopez-Paz, D.; Szlam, A. Optimizing the Latent Space of Generative Networks. arXiv:1707.05776 [cs, stat] . (25) Zhang, H.; Liu, C.-X.; Qi, X.-L.; Dai, X.; Fang, Z.; Zhang, S.-C. Topological Insulators in Bi 2 Se 3 , Bi 2 Te 3 and Sb 2 Te 3 with a Single Dirac Cone on the Surface. Nature Physics , (6), 438–442. https://doi.org/10.1038/nphys1270. (26) Heim, E. Constrained Generative Adversarial Networks for Interactive Image Generation. arXiv:1904.02526 [cs, stat] . (27) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; Persson, K. A. Commentary: The Materials Project: A Materials Genome Approach to Accelerating Materials Innovation. APL Materials , (1), 011002. https://doi.org/10.1063/1.4812323. (28) Opahle, I.; Madsen, G. K. H.; Drautz, R. High Throughput Density Functional Investigations of the Stability, Electronic Structure and Thermoelectric Properties of Binary Silicides. Phys. Chem. Chem. Phys. , (47), 16197–16202. https://doi.org/10.1039/C2CP41826F. (29) Kresse, G.; Furthmüller, J. Efficient Iterative Schemes for Ab Initio Total-Energy Calculations Using a Plane-Wave Basis Set. Phys. Rev. B , (16), 11169–11186. https://doi.org/10.1103/PhysRevB.54.11169. (30) Kramer, M. A. Nonlinear Principal Component Analysis Using Autoassociative Neural Networks. AIChE Journal , (2), 233–243. https://doi.org/10.1002/aic.690370209. (31) Ong, S. P.; Richards, W. D.; Jain, A.; Hautier, G.; Kocher, M.; Cholia, S.; Gunter, D.; Chevrier, V. L.; Persson, K. A.; Ceder, G. Python Materials Genomics (Pymatgen): A Robust, Open-Source Python Library for Materials Analysis. Computational Materials Science , , 314–319. https://doi.org/10.1016/j.commatsci.2012.10.028. (32) Ren, Z.; Noh, J.; Tian, S.; Oviedo, F.; Xing, G.; Liang, Q.; Aberle, A.; Liu, Y.; Li, Q.; Jayavelu, S.; Hippalgaonkar, K.; Jung, Y.; Buonassisi, T. Inverse Design of Crystals Using Generalized Invertible Crystallographic Representation. arXiv:2005.07609 [cond-mat, physics:physics] . (33) Chollet, F.; others. Keras ; 2015. (34) Géron, A.

Hands-On Machine Learning with Scikit-Learn and TensorFlow . (35) Singh, H. K.; Zhang, Z.; Opahle, I.; Ohmer, D.; Yao, Y.; Zhang, H. High-Throughput Screening of Magnetic Antiperovskites.