Efficient Cysteine Conformer Search with Bayesian Optimization
Lincan Fang, Esko Makkonen, Milica Todorovic, Patrick Rinke, Xi Chen
EEfficient Cysteine Conformer Search withBayesian Optimization
Lincan Fang, Esko Makkonen, Milica Todorovi´c, Patrick Rinke, ∗ and Xi Chen ∗ Department of Applied Physics, Aalto University, 00076 AALTO, Finland
E-mail: patrick.rinke@aalto.fi; xi.6.chen@aalto.fi a r X i v : . [ phy s i c s . c o m p - ph ] J un bstract Finding low-energy molecular conformers is challenging due to the high dimensionalityof the search space and the computational cost of accurate quantum chemical methods fordetermining conformer structures and energies. Here, we combine active-learning Bayesianoptimization (BO) algorithms with quantum chemistry methods to address this challenge.Using cysteine as an example, we show that our procedure is both efficient and accurate. Afteronly one thousand single-point calculations and approximately thirty structure relaxations,which is less than 10% computational cost of the current fastest method, we have foundthe low-energy conformers in good agreement with experimental measurements and referencecalculations.
Graphical TOC EntryIntroduction
A molecular conformer is a distinct confor-mation corresponding to a minimum on themolecule’s potential energy surface (PES). Anymolecule with rotatable bonds has several sta-ble conformer structures, each associated withdifferent chemical and electrical properties. Atambient temperatures, all the properties of thatmolecule are the combination of the propertiesof its conformers accessible at the temperatureof the study. Therefore, identifying the low-energy conformers and determining their en-ergy ranking continues to be a topic of greatinterest in computational chemistry, chemin-formatics, computational drug design, andstructure-based virtual screening. While oneconfiguration of a small molecule can be simu-lated routinely by ab initio methods, the largesize of configurational phase space and the con-siderable number of local minima in typical en-ergy landscapes make conformer searches oneof the persistent challenges in molecular mod-eling.
The first challenge in conformer search is suf-ficient sampling of the configurational space. The conformational space (bond lengths, bondangles, and torsions) for even relatively smallmolecules is enormous. Certain approximationsare necessary to make the problem tractable.Since the bond lengths and angles are relativelyrigid in molecules, and the different conform-ers originate from the flexible rotational groups,most search methods focus on sampling the tor-sion angles in molecules, while keeping bondlength and angles fixed. A variety of methodsand tools have been developed to generate di-verse conformer structures.
These methodscan be broadly classified to be either systematicor stochastic.A systematic method relies on a grid tosample all the possible torsion angles in themolecule. This approach is deterministic, butlimited to small molecules because it scalespoorly with increasing numbers of relevant tor-sion angles, i.e. search dimensions. Stochas-tic methods randomly sample the torsion an-gle phase space (sometimes restricted to prede-fined, most relevant ranges) based on differentalgorithms such as Monte Carlo annealing, minima hopping, basin hopping, distancegeometry and genetic algorithms. Stochas-2ic methods can be applied to larger moleculeswith high-dimensional conformer space, but thepredicted conformers may vary. Extensive sam-pling is required and the results may depend onthe random nature of the process.Knowledge-based methods have also been de-veloped to achieve more consistent results.They use a predefined library for torsion an-gles and ring conformations. The library istypically based on experimental structures indatabases such as the Cambridge StructureDatabase (CSD) or the Protein Data Bank(PDB). To search the conformers, knowledge-based methods usually need to be combinedwith the different systematic or stochastic al-gorithms mentioned before.The second challenge in conformer searchesis the sufficiently accurate mapping of energiesand structures. Two classes of total energy ap-proaches are commonly used: force field-basedmethods and quantum chemistry methods suchas density functional theory (DFT) and cou-pled cluster (CC). Quantum chemistry meth-ods achieve higher accuracy in the estimationof molecular energies than force fields becausethey describe the interactions and polarizationin molecules more accurately. However, theyare computationally costly. More often thannot, quantum chemistry methods are too ex-pensive to provide energies for all configurationsproduced in the search.To balance efficiency and accuracy, hierar-chical methods have been developed. Peopleemploy fast computational methods with loweraccuracy to scan configurational space, thenfunnel promising candidate structures throughmore costly methods with higher accuracy to re-fine the conformer structures and energies (suchas force fields → DFT or HF → MP2 → CCSD(T) ). Methods at different levels pre-dict different PES of molecules. To avoid miss-ing the true low-energy conformers, a large por-tion of configurational space have to be sampledat a lower accuracy method level, and manystructures need to be optimized at a higherlevel.In recent years, artificial intelligence (AI)and machine learning (ML) techniques suchas genetic algorithms, artificial neural net- work, and machine learned force fields were used to accelerate structure-to-energy pre-dictions for molecules. The majority of theseschemes requires a large number of data points,which may be costly to compute with ab-initio methods. To reduce the amount of requireddata, Bayesian optimization was introduced inthe structure search. Bayesian optimizationsearch schemes belong to the active learningfamily of methods, which generate data on thefly for optimal knowledge gain.Figure 1: Ball-and-stick model of the cysteinemolecule. Red is used for oxygen, white forhydrogen, gray for carbon, blue for nitrogen,and yellow for sulfur. d , d , d , d , d label thefive dihedral angles of cysteine that we use todefine our search space.In this article, we present a new working pro-cedure for molecular conformer identificationand ranking. We combined the Bayesian Opti-mization Structure Search (BOSS) approach and quantum chemistry simulations to find theconformers of small molecules and accuratelypredict their relative stability. BOSS is apython-based tool for global phase space explo-ration based on Bayesian optimisation. Be-yond the Bayesian active learning method forthe global minimum conformer search in theRef. 31, our procedure aims to find all the rele-vant conformers in one run. We use cysteine asa model system to demonstrate our methodol-ogy.Cysteine was chosen as the model system forseveral reasons. First, it is an amino acid withcritical biological functions. Second, it is theonly amino acid that has a -SH group. The3trong S-Ag and S-Au bonds make it uniquelyinteresting for use in hybrid nanomaterials.
Third, cysteine has five rotational groups, asshown in Figure 1. Therefore it is an inter-esting and accessible 5-dimesional (5-D) sys-tem for Bayesian optimization. Last, the struc-tures and the energy order of cysteine’s con-formers have been calculated by several groupsusing the grid sample method and char-acterized by Fourier transform microwave spec-troscopy experiments so that we can comparethe accuracy and efficiency of our new proce-dure with other computational and experimen-tal results.In brief, using cysteine as an example, wepresent an efficient and reliable procedure topredict the structures and energies of molecularconformers. BOSS ensures sufficient samplingof the configurational phase space and outputsthe structures associated with local energy min-ima. We post-process the machine-learned con-former candidates with geometry optimizationand then add free energy corrections to obtainthe final ranking. We tested the effect of dif-ferent exchange-correlation functionals and vande Waals interactions on the ranking order. Fi-nally, we applied coupled cluster corrections tothe lowest-energy conformers. The methodsand results will be presented in the followingsections. Methods
A. BOSS-based Molecular Con-former Search
Our BOSS-based procedure for molecular con-former search contains four steps: i SystemPreparation, ii Bayesian Optimization Con-former Search, iii
Refinement and iv Valida-tion, as illustrated in Figure 2a.In step i , we first obtain an xyz-file of ourmolecule of interest from a database and thenperform a single geometry optimization with aquantum chemistry method. Then we calcu-late the z-matrix to find the dihedral angles.We chose the dihedral angles d n to describe thedifferent conformers, as they are typically the most informative degrees of freedom for con-formers. We keep all bond lengths and anglesfixed at their optimized values. Such dimen-sionality reduction is standard practice to ex-pedite the molecular conformer search, as men-tioned in the introduction.In step ii , BOSS actively learns the PES ofthe molecule by Bayesian optimization iterativedata sampling. Each data “point” consists ofthe set of dihedral angles d n for a molecular con-figuration and its corresponding total energy E .In this step, we use DFT as the calculator. E therefore corresponds to the DFT total energyof a molecular configuration.BOSS employs Gaussian Process (GP) mod-els to fit a surrogate PES to the data points,then refines it by acquiring more data points atlocations which minimize the exploratory LowerConfidence Bound (eLCB) acquisition func-tion. The most-likely PES model for givendata is the GP posterior mean. The lack ofconfidence in the model is reflected by the GPposterior variance, which vanishes at the datapoints, and rises in unexplored areas of phasespace. The key concepts of this active learningapproach are illustrated in Figure 2b, in whichBOSS iteratively infers a one-dimensional PESof the d dihedral angle of cysteine. Theglobal minimum location and the entire PESare learned in 10 data acquisitions. In anal-ogy with the 1D example, BOSS actively learnsthe PES in N dimensions until convergence isachieved. The advantage of BOSS is not onlyits efficiency, but also the fact that it exploresboth the global minimum and local minima ofthe PES during the search. We exploit this fea-ture to find conformers beyond the global mini-mum, which we associate with the local minimaof the PES. A more detailed introduction of theBayesian optimization method can be found inRefs. 32 and 40.After the BOSS-predicted PES has con-verged, in step iii , we analyze the PES toextract the local minima locations and re-lated structures. Since the PES and its gra-dients can be computed efficiently at anylocation in the N-dimensional phase spacefrom the GP model, BOSS post-processingroutines perform multiple L-BFGS (limited-4igure 2: (a) Overview of our BOSS-based procedure for molecular conformer search, featuring i System Preparation, ii Bayesian Optimization Conformer Search, iii
Refinement and iv Validation.(b) BOSS iterative inference of a one-dimensional (1D) PES of the d dihedral angle of cysteine. TheGP’s native uncertainty (gray areas) facilitates exploratory data sampling. The global minimumlocation and the entire PES are learned in 10 data acquisitions.memory Broyden-Fletcher-Goldfarb-Shanno al-gorithm) minimizations, using the locations ofthe data acquisitions as starting points. Theresulting list of minima is processed to removeduplicates. Using cysteine as an example, iflocal minima structures have the same types of-CH SH chain and hydrogen bonds, we considerthem as duplicates and the higher energy struc-tures will be purged.The conformer set is next refined by geome-try optimization and entropy corrections. First,all degrees of freedom (including bond lengthsand angles) are relaxed to obtain optimizedstructures and energies. Next, we add vi-brational entropy corrections following previ-ous studies.
We compute and add the zero-point energy and the vibrational free energy at300K to the energies of optimized conformers.In step iii we also go beyond DFT. We per-form a coupled cluster calculation for the DFT-optimized conformer structures in a relevant en-ergy window. The difference between the cou-pled cluster and DFT total energy, here calledCC correction, is then added to the entropy cor- rections we added earlier in step iii .In step iv , we validate our results by com-paring the low-energy conformers we found toexperimental and other computational results.System preparation and final validation requirehuman input, but procedures featuring struc-ture search and refinement can be made fullyautomated into a computational workflow. B. Computational Methods
In this work, we employed DFT as the predom-inant energy calculator and employed the all-electron codes FHI-aims for all DFT cal-culations. ”Tight” numerical settings and ”tier2” basis sets were used throughout. To inves-tigate the influence of the exchange-correlationfunctional and the level of dispersion correctionon the final results, we performed our conformersearch with the PBE+TS,
PBE+MBD,
PBE0+TS and the PBE0+MBD func-tionals. For geometry optimizations, the geom-etry was considered to be converged when themaximum residual force was below 0.001 eV / ˚A.5ibrational free energies were computed us-ing the finite-difference method within theharmonic approximation. We used a finite-difference displacement length of δ = 0 . F vib was then cal-culated as follows F vib ( T ) = (cid:90) dωg ( ω ) ¯ hω (cid:90) dωg ( ω ) k B T ln [1 − exp ( − ¯ hωk B T )](1)where g ( ω ) is the phonon density of states andT, ω and k B are temperature, frequency andBoltzmann constant.Going beyond DFT, we performed CC cal-culations with single, double and perturbativetriple excitations (CCSD(T)). These were doneas single-point calculations using the structuresfrom the PBE0+MBD calculation with aug-cc-pVTZ basis sets. For validation purposes, wealso performed MP4 single-point calculationsfor selected conformers in their PBE0+MBDgeometries with 6-311++G(d,p) and aug-cc-pVTZ basis sets. We used the Gaussian16code for the CCSD(T) and MP4 simulations.To support open data-driven chemistry andmaterials science, we uploaded all calcula-tions of this work to the Novel Materials Dis-covery (NOMAD) laboratory. C. 2-D test
To test the accuracy and efficiency of step ii in our procedure, we started with a 2-D searchcase in cysteine (Figure 3). First we rotated the d and d dihedral angles to generate a refer-ence map on a fine grid (30 ×
30 points, Figure3a). Then d and d were sampled by BOSS. Inboth approaches, the bond lengths, bond an-gles and other dihedral angles ( d =180.03, d = 145.59 , d =180.03) were fixed in their DFT-optimized values. We obtained the energy ofeach structure with single-point PBE0+MBDcalculations and then plot the energy relativeto the global minimum.The 2-D PES maps after 60 and 120 data ac-quisitions are shown in Figure 3b and 3c. Look- Figure 3: The 2-D ( d , d ) PES map of cys-teine generated by (a) 30 × / instead of E is plotted.ing at Figure 3, we find that BOSS captures cor-rect minima and maxima already after 60 dataacquisitions (6% of the computational cost ofthe grid method), while after 120 data acqui-sitions the BOSS PES resembles the referencemap very well. This 2D PES features 6 en-ergy minima of similar depth, suggesting con-siderable complexity of cysteine conformationalphase space and many competing minima. Weapply abundant sampling in high-dimensionalproblems so that we can recover all relevant6onformer solutions. D. Cysteine Conformer Search in5-D
After demonstrating the BOSS rationale in 1-D and 2-D, we proceed to 5 dimensions. Thefive dihedral angles ( d - d ) in cysteine weresampled simultaneously by BOSS and the en-ergies of the corresponding configurations wereevaluated with the PBE0+MBD functional.Figure 4 illustrates the refinement of the pre-dicted global minimum with iterative configu-rational sampling. The lowest observed energy(calculated from the BOSS-predicted globalminimum conformer) is shown in Figure 4a andthe values of the corresponding dihedral anglesd n in Figure 4b. The lowest energy observeddecreases continuously. Throughout the proce-dure, the geometry of the global minimum con-former changes, as Figure 4b illustrates. Theglobal minimum undergoes several refinementsuntil, at iteration 830, both the energy and thedihedral angles are converged and only havenegligible changes (∆ E < d < ◦ ).Improvements of the global minimum predic-tion is due to instances of visiting low energyconfigurations chosen smartly form a vast 5-Dspace. However, most model refinements pro-ceeded with higher energy conformers and ex-plores local minima of the PES, on average inthe region 0.4 eV above the predicted globalminimum, as shown by the red dashed line inFigure 4a.Next we address the convergence of the lowenergy part of the PES. This is not a trivialtask, as we cannot monitor the PES in everypoint of 5-D space. It also turns out to be im-practical to track the dihedral angles of severallow energy conformers and monitor convergenceas we did for the global minimum. The reasonis that many conformers are very close in en-ergy and switch order as the iterations progress.We therefore decided to take the energy-versus-conformer-number curve as convergence indica-tor.Figure 5 shows the relative energy to theglobal minimum for all minima found up to 0.4 Figure 4: (a) Convergence of the global min-imum energy computed from the BOSS pre-dicted global minimum configuration (blackline). The average computed energy of the sam-pled conformers is shown with red dashed line.(b) Value of the dihedral angles d n of the BOSSpredicted global minimum as a function of thenumber of sampled points.Figure 5: Progression of the relative energyof predicted local-minima for a PBE0+MBDBOSS run with a total number of 1200 itera-tions. Shown are intermediate curves at 400,600, 800 and 1000 iterations.eV after 400, 600, 800, 1000 and 1200 BOSSiterations. In the figure, 0 eV is set to be thelowest energy found in the 1000-th iteration.The curves after only 400 and 600 iterations7till rise steeply and feature the wrong globalminimum (i.e. do not start at 0 eV). With in-creasing number of iterations, the curves flattenand gradually approach the curve for 1200 it-erations. At 1000 iterations the curve is verysimilar to that of 1200 iterations, which sug-gests that not only the global but also the localminima are converged. The 2-D (d1, d2) pro-jected PES maps are presented in Figure S1.To reliably identify even higher-energy PESminima, BOSS post-processing routines con-duct an exhaustive minimiser-based search.Consequently, the resulting local minima listmay contain similar structures or duplicates.In this study, if two local minima struc-tures have the same types of -CH SH chain(∆ d < ◦ ) and hydrogen bonds (∆ d < ◦ ,∆ d < ◦ , ∆ d < ◦ ), we consider them as du-plicates and only keep the lowest energy one forfurther processing.Next we optimize the structures and includethe vibration energy as described in Section 2A.Finally, we apply CCSD(T) single-point correc-tions to the 15 lowest energy conformers ob-tained from the PBE0+MBD calculations. Results and Discussion
Using the methodology introduced in the pre-vious sections, we performed four indepen-dent conformer searches with the PBE+TS,PBE+MBD, PBE0+TS and the PBE0+MBDfunctionals for cysteine. In this section, wesystemically assess how the different exchange-correlation functionals and van de Waals cor-rections affect the results and discuss how thedifferent steps improve accuracy. We also com-pare our predictions with the experimental re-sults and reference calculations.
We choose two references to make the com-parison and validate our results. Ref 1 contains both experimental and computationalresults. The computational energy orderingis obtained from single-point MP4 calcula-tion on MP2 optimized structures using 6-311++G(d,p) basis sets. In the reference, sixexperimental conformers were found by rota-tional spectroscopy (labeled in red in Figure 6); five other low-energy conformers were pre-dicted from the MP4 simulations, but were notdetected in the experiment (labeled in blackin Figure 6). The authors of Ref 2 did asystematic scan of 11644 initial structures atthe HF/3-21G level, located 71 unique con-formers of cysteine using the MP2(FC)/cc-pVTZ method and finally determined the rel-ative energy of the eleven lowest-energy con-formers with CCSD(T). Ref 2 also provides xyz-coordinates for the observed conformers. Conformer hierarchy
The predicted 15 lowest energy conformerstructures of cysteine with the PBE0+MBDfunctional are shown in Figure 6. The atomiccoordinates of the conformers can been found inthe Supplementary Material. To directly com-pare our results with those reported in Ref 1, we assign our structures the same labels asRef 1. All the eleven conformers in Ref 1 havebeen identified in our simulations within an en-ergy window of 0.2 eV from the global mini-mum. In addition, BOSS predicted new con-formers, which we named N1, N2 . . . . Some ofthem are shown in Figure 6. The new conform-ers BOSS predicted generally have a higher en-ergy.The relative stability of the PBE0+MBD con-formers is shown in Figure 7a. Correspondingplots for the PBE+TS, PBE+MBD and thePBE0+TS functionals are presented in FiguresS2-S4 of the Supporting Information. To illus-trate the importance of different contributionsto the energy hierarchy, Figure 7a and FigureS2-S4 show not only the final energy order butalso intermediate steps.The hierarchy figures show that once the con-formers are extracted, geometry optimizationplays a major role in refining their energy rank-ing. The largest energy changes and reorderinghappens in this step. In the PBE0+MBD sim-ulation, the average energy change of the moststable 15 conformers during the geometry opti-mization is 0.095 eV, while the dihedral anglesof the corresponding structures change on aver-age by ∆ d = 16 . ◦ , ∆ d = 20 . ◦ , ∆ d = 8 . ◦ ,∆ d = 26 . ◦ and ∆ d = 11 . ◦ .8igure 6: Predicted low energy conformers of cysteine from the PBE0+MBD search. Conformersare named as I (NH–O=C), II (OH–N), and III (NH–OH) depending on the type of the hydrogenbonds, and as a, b, or c depending on the configuration of the − CH SH side chain, following Ref 1. The experimentally detected conformers are marked in red and other conformers marked in black.The colour scheme of the atoms is the same as in Fig. 1.The entropy corrections have a smaller ef-fect on the conformer ordering. The zero-point energy contributions (+VE (0K) column)does not trigger any conformer reordering. Itdoes, however, compress the energy spectrumas corrections for higher-energy conformers arelarger than for the global minimum. The finitetemperature corrections (+VE (300K) column)leads to a further compression of the energyspectrum. Now a couple of conformers above0.1 eV switch orders as their vibrational entropycontributions differ.The final column shows our most accurateconformer energy hierarchy, which now includesalso the CCSD(T) corrections. We observe thatthe CCSD(T) corrections are sensitive to theconformer geometry. They generally shift con-formers down in energy towards the global mini-mum conformer. This reduces the energy spac-ing between the conformers. Conformers IIaand IIc are an exception. They remain at theroughly the same relative energy to the global minimum, which is also of conformer type II.They subsequently trade places with other con-formers in the hierarchy.The energy hierarchy of the low energy con-formers from different methods are also listedin Table 1.
Validation
To validate our optimized conformer structures,we start with Ref 2. The geometries reportedin Ref 2 were obtained at the MP2(FC)/aug-cc-pV(T + d)Z level and we compare them againstour PBE0+MBD geometries. To standardizethe comparison, we use the same conformernaming convention as in Ref 1. Among the top ten most stable structures,Ref 2 reports eight structures that we andRef 1 also found (see Tab. 2). These are IIb,IIa, Ib, I’b, Ia, III β b, III β c and III α b. The aver- Using our classification standard, we classified theVI(n/a) conformer in Ref 2 to be IIb. age differences of the dihedral angles betweenour and Ref 2’s geometries are ∆ d = 4 . ◦ ,∆ d = 1 . ◦ , ∆ d = 2 . ◦ , ∆ d = 0 . ◦ and∆ d = 3 . ◦ . These small differences indi-cate that we indeed found the right conformersand that the PBE0+MBD and MP2 geometriesagree closely.Ref 1, unfortunately, does not provideatomic coordinates for the reported conformers.To validate our optimized conformer structuresagainst those of Ref 1, we therefore performedMP4 single-point energy calculations with thesame basis set 6-311++G(d,p) as in Ref 1, but for our PBE0+MBD geometries. The re-sults are reported in Figure 7b and Table 1.Figure 7b and Table 2 show that the ener-gies of the two MP4 calculations (MP4(b1) and MP4(b1) ) agree within 4 meV for each con-former. This close match indicates that ourconformer geometries agree very well with thoseof Ref 1, validating our BOSS-based conformersearch procedure.Table 1 shows the final energy ranking ofthe top ten most stable conformers in Ref 1, Ref 2 and our computational predictions. Amore complete list of the low-energy conformersand their relative energy can be found in TableS1.In our simulations, PBE+TS, PBE+MBD,PBE0+TS and PBE0+MBD all found the cor-rect global minimum structure IIb. PBE+TS,PBE0+TS and PBE0+MBD predicted thesix experimental identified conformers amongthe top seven most stable structures, while10BE+MBD locates the six conformers amongthe top eight most stable ones.In Figure 8, we summarize the comparisonacross the four different exchange-correlationfunctionals we tested. Our reference are theCCSD(T) energies at the PBE0+MBD geome-tries. In Figure 8 we list the conformers thathave a different energy ordering in the DFTand CCSD(T). The energy differences betweenthe cysteine conformers are extremely small.Therefore, it is no surprise that the DFT en-ergy rankings differ from the CCSD(T) re-sults. The accuracy of the different DFT func-tional are then evaluated by the energy dif-ferences comparing to CCSD(T), using the 10lowest energy conformers in CCSD(T)). Com-paring to CCSD(T), the average energy differ-ence is 0.044 eV for PBE+TS, 0.046 eV forPBE+MBD, 0.031 eV for PBE0+TS and 0.030eV for PBE0+MBD (Figure 8). PBE0 is onaverage 0.01 eV more accurate than PBE. Thedifference between the different van der Waalstreatments (TS or MDB) is an order of magni-tude smaller (1 or 2 meV on average). The in-fluence of the different vdW treatments is there-fore negligible for a small molecule like cysteine.For cysteine, we can conclude that PBE+TS issufficient for the conformer search.Since BOSS is able to sample the configura-tional space very efficiently, we performed thewhole conformer search at the PBE0+MBDlevel. For larger molecules, it might becomemore economical to perform an initial BOSS-based conformer search at the PBE+TS leveland to post-relax only a certain number of low-energy conformers with PBE0+MBD.Our CCSD(T) calculations produce a verysimilar energy ranking as the MP4 results inRef 1, as shown in Table 1. The only differ-ence with Ref 1 is the placement of I (cid:48) b, III β b.If we use the same aug-cc-pvtz basis set, samegeometries and same vibrational energy correc-tion from our PBE0+MBD simulations in bothCCSD(T) and MP4, we get the same energy or-der. Therefore, the differences are not causedby the choice of CCSD(T) or MP4. Since wehave validated that we have found very similarstructures as Ref 1 (Figure 7b), the differencemay due to the facts that Ref 1 did not in- Figure 8: Summary of DFT results: each panelshows the average energy difference between therespective DFT functional and the CCSD(T)reference energies for the 10 lowest conformers.In addition, each panel lists the conformers thathave a different order than in CCSD(T).clude the entropy correction and used differentbasis set.Ref 2 has found two new structures simi-lar to IIa, not appearing in Ref 1 and ourresults. Except for these two new structures,the only difference between our CCSD(T) andthe CCSD(T) results in Ref 2 is the orderingof I (cid:48) b and III β b. Again, the energy differencesbetween the conformers in this range are ex-tremely small, and ordering differences in ourresults and the reference can be ascribed to theslight difference of the conformer structures andcomputational settings. Ref 2 used a differentvibrational correction method and included theFocal-Point analysis to extrapolate the energiesto the complete basis set.Comparing our CCSD(T) results to experi-ment, we note that the CCSD(T) ordering ofIIb, Ib, Ia as the three lowest energy conformersagrees with the experimental ordering derivedfrom the relative abundance of the detectedconformers. However the order of Ia and Ib isswitched, the same as the computational rank-ing in Ref 1 and Ref 2. For the next threeconformers, the experiment finds IIa, III β b andIII β c, however, with much lower overall abun-dance than the first three conformers. The cou-pled cluster order is different with III β b, I (cid:48) b, IIaand III β c. These differences can be ascribed tothe low experimental abundance, which mightmake an unambiguous classification difficult, orto additional experimental factors that are not11aken into account in our simulations. Computational efficiency
We close with a note on the efficiency of ournew conformer search procedure without explic-itly performing other search methods in thiswork. BOSS predicts a physically meaningful5-D PES with only 1000 single-point DFT cal-culations. We can put this number of single-point calculations into perspective, by consid-ering that FHI-aims requires on average 30 ge-ometry optimization steps to relax the struc-ture of cysteine. The computational cost of1000 single-point DFT calculations is thereforeequivalent to approx. 30 DFT geometry opti-mizations.From the PES, we extract all relevant low-energy conformers with the build-in BOSS min-ima tool at no further computational expense.In this work, we consider approx. 30 local min-ima, each of which is geometry optimized withDFT. This amounts to 30 geometry optimiza-tions, which is equivalent to approx. 900 DFTsingle-point calculations.Our total computational expense per DFTfunctional for a complete conformer search ofcysteine is therefore 1900 DFT single-point cal-culations or equivalently 60 geometry optimiza-tions. This is a very small computational bud-get, compared to systematic or stochastic conformer search methods that need to relaxthousands of structures. Supady et al. providedetailed numbers for a genetic algorithm (GA)based conformer search of dipeptides. Theirsearch encompasses between 4 and 6 degress offreedom and is therefore similar to ours, as isthe size of the molecules. The GA search re-quires between 20,000 and 60,000 single-pointDFT calculations (referred to as force evalu-ations in Ref. 27) depending on the size ofthe search space and the density of conform-ers in the energy hierarchy. Our BOSS-basedprocedure is a factor of 10 more efficient. Asimilar speed-up was recently observed in aGaussian-process-based structure search of ox-idized graphene on the Ir(111). Conclusions
In summary, we propose a new conformersearch procedure that combines the Bayesianoptimization active learning with quantumchemistry methods. BOSS find all the rele-vant conformers in one run, with all the exper-imental ones already list within the lowest en-ergy conformers. Then we refine the low-energyconformers by DFT structure relaxation, vi-brational energy, and coupled cluster correc-tion. We conclude that the DFT structure re-laxation plays a major role in the refinement ofthe energy order. We also find that PBE0 givesslightly better results than PBE, but the differ-ence between the TS and MBD van der Waalsinteractions are tiny for our system.Unlike traditional conformer search methods,our approach is computationally tractable whileretaining the accuracy of the chosen quantumchemical method throughout. The method isstraightforward to apply to conformer searchof other molecules. Besides the local minimastructures and energies, our procedure also pre-dicts the whole energy landscapes. In futurework, the converged energy landscapes can beadditional data-mined for energy barriers be-tween pairs of conformers.
Acknowledgement
This work was sup-ported by the Academy of Finland (Projectnumbers 308647, 314298 and 316601) andthrough their Flagship programme: FinnishCenter for Artificial Intelligence FCAI. Wethank CSC, the Finnish IT Center for Sci-ence and Aalto Science IT for computationalresources. This work is supported by COST(European Cooperation in Science and Technol-ogy) Action 18234. Lincan Fang thanks GuoxuZhang, Marc Dvorak, Jingrui Li and AnnikaStuke for help with FHI-Aims. He also acknowl-edges financial support from Chinese Scholar-ship Council (Grant No. [2017]3109).
Supporting Information Avail-able
The following files are available free of charge.12 able 1: The energy order of the ten most stable conformers of cysteine from ourDFT, MP4 and CCSD(T) computations and Ref. 38 (Our CCSD(T) and MP4 resultsare based on PBE0+MBD structures. b1: 6-311++G(d,p) basis set, b2: aug-cc-pvtzbasis set, *: vibrational energy correction not included.)
Energy orderPBE+TS IIb IIa Ib Ia III β b I (cid:48) b III β c IIc III α a III α bPBE+MBD IIb IIa Ia Ib III β b I (cid:48) b IIc III β c III α a III α bPBE0+TS IIb IIa Ia Ib I (cid:48) b III β b III β c III α a N1 N2PBE0+MBD IIb IIa Ia Ib I (cid:48) b III β b III β c III α a III α b IIcMP4 (b1) ∗ IIb Ib Ia I (cid:48) b IIa III β b III β c III α b III α a III α cMP4 (b1) ∗ IIb Ib I (cid:48) b Ia IIa III β b III β c III α b III α a III α cMP4 (b2) IIb Ib Ia III β b I (cid:48) b IIa III β c III α b III α a III α cCCSD(T) (b2) IIb Ib Ia III β b I (cid:48) b IIa III β c III α b III α a III α cCCSD(T) IIb Ib Ia I (cid:48) b IIa III β b n/a n/a III β c III α bExp IIb Ia Ib IIa III β c III β bAbundance ratio
10 10 8 3 3 2
Table 2: Predicted low-energy conformers of cystine and relative energies with respectto the global minimum in eV (b1: 6-311++G(d,p) basis set, b2: aug-cc-pvtz basis set,*:vibrational energy correction not included.
Conformer IIb Ib Ia III β b I (cid:48) b IIa III β c III α b III α a III α c IIcMP4 (b1) ∗ ∗ References (1) Hawkins, P. C. D. Conformation Gen-eration: The State of the Art.
Jour-nal of Chemical Information and Modeling , , 1747–1756.(2) Wlodek, S.; Skillman, A. G.; Nicholls, A.Ligand Entropy in Gas-Phase, Upon Sol-vation and Protein Complexation. FastEstimation with Quasi-Newton Hessian. Journal of Chemical Theory and Compu-tation , , 2140–2152.(3) Friedrich, N.-O.; de Bruyn Kops, C.;Flachsenberg, F.; Sommer, K.; Rarey, M.;Kirchmair, J. Benchmarking CommercialConformer Ensemble Generators. Jour-nal of Chemical Information and Modeling , , 2719–2728. (4) Friedrich, N.-O.; Meyder, A.;de Bruyn Kops, C.; Sommer, K.; Flach-senberg, F.; Rarey, M.; Kirchmair, J.High-Quality Dataset of Protein-BoundLigand Conformations and Its Appli-cation to Benchmarking ConformerEnsemble Generators. Journal of Chemi-cal Information and Modeling , ,529–539.(5) Schwab, C. H. Conformations and 3Dpharmacophore searching. Drug DiscoveryToday: Technologies , , e245 – e253.(6) Hawkins, P. C. D.; Skillman, A. G.;Nicholls, A. Comparison of Shape-Matching and Docking as VirtualScreening Tools. Journal of MedicinalChemistry , , 74–82.(7) Vainio, M. J.; Johnson, M. S. Generating13onformer Ensembles Using a Multiobjec-tive Genetic Algorithm. Journal of Chem-ical Information and Modeling , ,2462–2474.(8) Puranen, J. S.; Vainio, M. J.; John-son, M. S. Accurate conformation-dependent molecular electrostatic po-tentials for high-throughput in silicodrug discovery. Journal of ComputationalChemistry , , 1722–1732.(9) Miteva, M. A.; Guyon, F.; Tuff´ery, P.Frog2: Efficient 3D conformation ensem-ble generator for small compounds. Nu-cleic Acids Research , , W622–W627.(10) Hawkins, P. C. D.; Skillman, A. G.; War-ren, G. L.; Ellingson, B. A.; Stahl, M. T.Conformer Generation with OMEGA: Al-gorithm and Validation Using High Qual-ity Structures from the Protein Data-bank and Cambridge Structural Database. Journal of Chemical Information andModeling , , 572–584.(11) Landrum G (2011) RDKit: open-sourcecheminformatics.(12) Chemical Computing Group.(13) Chang, G.; Guida, W. C.; Still, W. C. Aninternal-coordinate Monte Carlo methodfor searching conformational space. Jour-nal of the American Chemical Society , , 4379–4386.(14) Wilson, S. R.; Cui, W.; Moskowitz, J. W.;Schmidt, K. E. Applications of simulatedannealing to the conformational analysisof flexible molecules. Journal of Compu-tational Chemistry , , 342–349.(15) Goedecker, S. Minima hopping: An effi-cient search method for the global min-imum of the potential energy surface ofcomplex molecular systems. Journal ofChemical Physics , , 9911–9917. (16) Sutherland-Cash, K. H.; Wales, D. J.;Chakrabarti, D. Free energy basin-hopping. Chemical Physics Letters , , 1–4.(17) Spellmeyer, D. C.; Wong, A. K.;Bower, M. J.; Blaney, J. M. Conforma-tional analysis using distance geometrymethods. Journal of Molecular Graphicsand Modelling , , 18 – 36.(18) Mekenyan, O.; Dimitrov, D.; Nikolova, N.;Karabunarliev, S. Conformational Cover-age by a Genetic Algorithm. Journal ofChemical Information and Computer Sci-ences , , 997–1016.(19) Kothiwale, S.; Mendenhall, J. L.;Meiler, J. BCL::Conf: small moleculeconformational sampling using a knowl-edge based rotamer library. Journal ofCheminformatics , , 47.(20) Cole, J. C.; Korb, O.; McCabe, P.;Read, M. G.; Taylor, R. Knowledge-Based Conformer Generation Using theCambridge Structural Database. Journalof Chemical Information and Modeling , , 615–629.(21) Allen, F. The Cambridge StructuralDatabase: a quarter of a million crystalstructures and rising. Acta Crystallograph-ica Section B , , 380–388.(22) Berman, H. M.; Westbrook, J.; Feng, Z.;Gilliland, G.; Bhat, T. N.; Weissig, H.;Shindyalov, I. N.; Bourne, P. E. The Pro-tein Data Bank. Nucleic Acids Research , , 235–242.(23) Rossi, M.; Scheffler, M.; Blum, V. Im-pact of Vibrational Entropy on the Sta-bility of Unsolvated Peptide Helices withIncreasing Length. The Journal of Physi-cal Chemistry B , , 5574–5584.(24) Chutia, S.; Rossi, M.; Blum, V. WaterAdsorption at Two Unsolvated Peptideswith a Protonated Lysine Residue: FromSelf-Solvation to Solvation. The Journal of hysical Chemistry B , , 14788–14804.(25) Wilke, J. J.; Lind, M. C.; Schaefer, H. F.;Cs´asz´ar, A. G.; Allen, W. D. Conformersof Gaseous Cysteine. Journal of ChemicalTheory and Computation , , 1511–1523.(26) Nair, N.; Goodman, J. M. Genetic Algo-rithms in Conformational Analysis. Jour-nal of Chemical Information and Com-puter Sciences , , 317–320.(27) Supady, A.; Blum, V.; Baldauf, C. First-Principles Molecular Structure Searchwith a Genetic Algorithm. Journal ofChemical Information and Modeling , , 2338–2348.(28) Chen, X.; Jørgensen, M. S.; Li, J.; Ham-mer, B. Atomic Energies from a Convolu-tional Neural Network. Journal of Chem-ical Theory and Computation , ,3933–3942.(29) Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.;Lubbers, N.; Devereux, C.; Barros, K.;Tretiak, S.; Isayev, O.; Roitberg, A. E.Approaching coupled cluster accuracywith a general-purpose neural network po-tential through transfer learning. NatureCommunications , , 2903–2910.(30) Sauceda, H. E.; Poltavsky, I.; Chmiela, S.;M¨uller, K.-R.; Tkatchenko, A. Con-struction of Machine Learned ForceFields with Quantum Chemical Accu-racy: Applications and Chemical Insights. arXiv:1909.08565 (31) Chan, L.; Hutchison, G. R.; Mor-ris, G. M. Bayesian optimization for con-former generation. Journal of Cheminfor-matics , , 32.(32) Todorovi´c, M.; Gutmann, M. U.; Coran-der, J.; Rinke, P. Bayesian inference ofatomistic structure in functional materi-als. npj Computational Materials , . (33) https://gitlab.com/cest-group/boss.(34) Lee, H.-E.; Ahn, H.-Y.; Mun, J.;Lee, Y. Y.; Kim, M.; Cho, N. H.;Chang, K.; Kim, W. S.; Rho, J.;Nam, K. T. Amino-acid- and peptide-directed synthesis of chiral plasmonic goldnanoparticles. Nature , , 360–365.(35) H¨akkinen, H. The goldsulfur interface atthe nanoscale. Nature Chemistry , ,443–455.(36) Gronert, S.; O’Hair, R. A. J. Ab InitioStudies of Amino Acid Conformations. 1.The Conformers of Alanine, Serine, andCysteine. Journal of the American Chem-ical Society , , 2071–2081.(37) Dobrowolski, J. C.; Rode, J. E.; Sadlej, J.Cysteine conformations revisited. Jour-nal of Molecular Structure: THEOCHEM , , 129 – 134.(38) Sanz, E. M.; Blanco, S.; L´opez, J. C.;Alonso, J. L. Rotational Probes ofSix Conformers of Neutral Cysteine. Angewandte Chemie International Edition , , 6216–6220.(39) Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learn-ing ; the MIT Press, 2006.(40) Brochu, E.; Cora, V. M.; de Fre-itas, N. A Tutorial on Bayesian Opti-mization of Expensive Cost Functions,with Application to Active User Modelingand Hierarchical Reinforcement Learning.arXiv:1012.2599v1.(41) Rossi, M.; Chutia, S.; Scheffler, M.;Blum, V. Validation Challenge of Density-Functional Theory for Peptides–Exampleof Ac-Phe-Ala5-LysH+.
The Journal ofPhysical Chemistry A , , 7349–7359.(42) Hoja, J.; Ko, H.-Y.; Neumann, M. A.;Car, R.; DiStasio, R. A.; Tkatchenko, A.Reliable and practical computational de-scription of molecular crystal polymorphs. Science Advances , , eaau3338.1543) Blum, V.; Gehrke, R.; Hanke, F.;Havu, P.; Havu, V.; Ren, X.; Reuter, K.;Scheffler, M. Ab initio molecular simu-lations with numeric atom-centered or-bitals. Computer Physics Communica-tions , , 2175 – 2196.(44) Havu, V.; Blum, V.; Havu, P.; Schef-fler, M. Efficient O(N) integration for all-electron electronic structure calculationusing numeric basis functions. Journal ofComputational Physics , , 8367 –8379.(45) Ren, X.; Rinke, P.; Blum, V.;Wieferink, J.; Tkatchenko, A.; San-filippo, A.; Reuter, K.; Scheffler, M.Resolution-of-identity approach toHartree–Fock, hybrid density functionals,RPA, MP2 and GW with numeric atom-centered orbital basis functions. NewJournal of Physics , , 053020.(46) Perdew, J. P.; Burke, K.; Ernzer-hof, M. Generalized Gradient Approxima-tion Made Simple. Phys. Rev. Lett. , , 3865–3868.(47) Tkatchenko, A.; Scheffler, M. AccurateMolecular Van Der Waals Interactionsfrom Ground-State Electron Density andFree-Atom Reference Data. Phys. Rev.Lett. , , 073005.(48) Tkatchenko, A.; DiStasio, R. A.; Car, R.;Scheffler, M. Accurate and EfficientMethod for Many-Body van der Waals In-teractions. Phys. Rev. Lett. , ,236402.(49) Adamo, C.; Barone, V. Toward reliabledensity functional methods without ad-justable parameters: The PBE0 model. The Journal of Chemical Physics , , 6158–6170.(50) Frisch, M. J.; Trucks, G. W.;Schlegel, H. B.; Scuseria, G. E.;Robb, M. A.; Cheeseman, J. R.; Scal-mani, G.; Barone, V.; Petersson, G. A.;Nakatsuji, H. et al. Gaussian16 Revision C.01. 2016; Gaussian Inc. WallingfordCT.(51) Himanen, L.; Geurts, A.; Foster, A. S.;Rinke, P. Data-Driven Materials Science:Status, Challenges, and Perspectives. Ad-vanced Science , , 1900808.(52) The Novel Materials Discovery (NOMAD)Laboratory. https://nomad-coe.eu/ .(53) Bisbo, M. K.; Hammer, B. Efficient GlobalStructure Optimization with a Machine-Learned Surrogate Model. Physical Re-view Letters ,124