Predicting Kovats Retention Indices Using Graph Neural Networks
Chen Qu, Barry I. Schneider, Anthony J. Kearsley, Walid Keyrouz, Thomas C. Allison
aa r X i v : . [ phy s i c s . c h e m - ph ] D ec Predicting Kov´ats Retention Indices Using GraphNeural Networks ⋆ Chen Qu a , Barry I. Schneider a , Anthony J. Kearsley a , Walid Keyrouz a ,Thomas C. Allison a, ∗ a National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg,Maryland 20899, USA
Abstract
The Kov´ats retention index is a dimensionless quantity that characterizes therate at which a compound is processed through a gas chromatography column.This quantity is independent of many experimental variables and, as such, isconsidered a near-universal descriptor of retention time on a chromatographycolumn. The Kov´ats retention indices of a large number of molecules have beendetermined experimentally. The “NIST 20: GC Method/Retention Index Li-brary” database has collected and, more importantly, curated retention indicesof a subset of these compounds resulting in a highly valued reference database.The experimental data in the library form an ideal data set for training machinelearning models for the prediction of retention indices of unknown compounds.In this article, we describe the training of a graph neural network model to pre-dict the Kov´ats retention index for compounds in the NIST library and comparethis approach with previous work [1]. We predict the Kov´ats retention indexwith a mean unsigned error of 28 index units as compared to 44, the putativebest result using a convolutional neural network [1]. The NIST library also in-corporates an estimation scheme based on a group contribution approach thatachieves a mean unsigned error of 114 compared to the experimental data. Our ⋆ Official contribution of the National Institute of Standards and Technology; not subjectto copyright in the United States. ∗ Corresponding author
Email addresses: [email protected] (Chen Qu), [email protected] (Barry I.Schneider), [email protected] (Anthony J. Kearsley), [email protected] (Walid Keyrouz), [email protected] (Thomas C. Allison)
Preprint submitted to Elsevier January 1, 2021 ethod uses the same input data source as the group contribution approach,making its application straightforward and convenient to apply to existing li-braries. Our results convincingly demonstrate the predictive powers of system-atic, data-driven approaches leveraging deep learning methodologies applied tochemical data and for the data in the NIST 20 library outperform previousmodels.
Keywords:
Kov´ats retention index, machine learning, graph neural network,gas chromatography
1. Introduction
Gas chromatography (GC) is an important analytical technique for the sep-aration and identification of chemical compounds. Frequently used in combina-tion with mass spectrometry (GC/MS) as a means of enhancing the accuracyof identification of chemical compounds, a GC experiment consists of puttingan unknown substance in a gaseous state into a carrier gas. This gas is passedthrough a chromatography column. Interactions with compounds used in thecolumn determine the time elapsed before a compound elutes from (i.e., passesthrough) the column. The time a compound is retained in the column is indexedagainst elution times of known compound is called the retention index.Work in the 1950s by Ervin Kov´ats [2] demonstrated that the retention indexcould be made independent of many experimental factors such as column length,column diameter, and film thickness. This results in a dimensionless quantityknown as the Kov´ats retention index, which is the quantity we consider in thisarticle. (There are other retention indices not considered herein.)In the context of identifying unknown compounds by searching a library ofmeasured compounds, matching the retention index can significantly improvethe confidence in results generated by library searching versus use of the massspectrum alone.[3] Knowledge of retention indices thus has considerable value,and the ability to predict the retention index accurately constitutes a notableimprovement to molecular libraries. 2 wide variety of techniques have been employed to predict retention indicesfor various classes of compounds, many with considerable success. We brieflysummarize the achievements of several studies below. Early studies using neuralnetworks are characterized by small networks and small data sets of closelyrelated compounds. Later studies consider larger data sets and more diversesets of compounds. The present work was performed with the largest set of datacurrently available, significantly larger than those used in any other publishedinvestigation.One of the earliest (1993) applications of neural networks to the predictionof the retention index is due to Bruchmann et al. [4] who studied a set of ter-penes. They discovered that their neural network was in good agreement withmultilinear regression, but that the neural network did not generalize well toother classes of compounds (whereas multilinear regression performed better inthis regard). In a 1999 publication, Pompe and Noviˇc used two artificial neu-ral network approaches to predict retention indices for 381 compounds.[5] Theinput to their model consisted of 16 informational and topological descriptorscalculated from molecular structures. This better model gave an error of 19.2retention index units, outperforming a multilinear regression approach by 3.3.Jalali-Heravi and Fatemi [6] studied a set of 53 terpenes using a small arti-ficial neural network and found agreement with experiment within 2 %. Li etal. [7] used topological indices and a neural network to estimate retention in-dices for 18 compounds. Using molecular descriptors such as boiling point andmolecular weight as input to an artificial neural network, ˇSkrbi´c and Onjia [8]successfully predicted Lee retention indices [9]. Their results agreed with exper-imental values within 1 . . . .
96 % MAE), whereas the modelof Stein et al. yielded a 108.4 MAE.[1, 10] As the methodology employed inthis work is similar in some respects to our own work (in particular, both em-ploy deep learning techniques), we will compare our model against the model ofMatyushin et al. [1] extensively.The present work makes use of the MatErials Graph Network (MEGNet)model developed by Chen et al.[24] This model uses a graph convolutional neuralnetwork and has been used to produce state of the art performance in predict-ing properties of molecular and crystalline compounds using detailed structuralinformation. (A survey article on graph neural networks by Wu et al. [25] givesan in-depth discussion of this methodology.) Graph convolutional networks rep-resent a significant advance in machine learning that is particularly well suitedfor molecular problems.The movement toward new machine learning architectures accelerated overthe last few years, due to the existence of large data sets and much improvedcomputational hardware and software, opening the opportunity for testing thelimits of chemical property prediction. Early efforts used classical neural net-work architectures; much effort was dedicated to finding input representations6hat encoded increasing amounts of chemical information as well as addressingthe problem of varying input vector length as molecule sizes (i.e., number ofatoms) changed. A breakthrough publication by the Google Mind team used adeep learning architecture and demonstrated good performance for predicting to-tal energies of molecules computed using density functional theory (DFT).[26] Anumber of methodologies were developed in the following years that improvedthe accuracy and range of properties that could be predicted using machinelearning models. These are important advances in this young and fast movingfield of molecular and crystal property prediction using deep learning models.However, these efforts are limited by the lack of large and diverse datasets. Most of the work to date has used the QM9 data set that contains133 885 molecules containing C, H, O, N, and/or F molecules with 9 or fewer Catoms.[27, 28] These data sets rely on accurate geometric information obtainedvia optimization of the molecular structure using DFT with a modest basis set.Thus, the machine learning models are able to predict a relatively homogeneous(both chemically and methodologically) set of data. In particular, the error inthe DFT energies and properties is systematic and thus amenable to learning asopposed to the various sources of noise present in experimental data (includingoutright errors). We thus temper our expectations and do not anticipate accu-racy of 1 or 2 retention index units. Indeed, this conclusion is supported by thework of Matyushin et al.[1]Having set the stage for the importance and tractability of predicting theKov´ats retention index for various molecules, we now describe our data set,model, and results with an emphasis on comparison with previous results andwith a look to the future of machine learning in this area.
2. Data and methods
The data for this study were obtained from NIST Standard Reference Database1A: NIST/EPA/NIH Mass Spectral Library (NIST 20).[29] The data contained7n the mass spectral library are a subset of the data available in NIST 20: GCMethod/Retention Index Library.[17] The mass spectral library set contains a2D representation of the molecule (in Molfile format), the mass spectrum, ameasured (i.e., experimental) value of the Kov´ats retention index, a predictedvalue of the Kov´ats retention index based on the model of Stein et al., an un-certainty on the predicted retention index value, and other data and metadataon the chemical compound. When multiple experimental determinations of theretention index are available, a consensus value is given along with the numberof values and their range. Importantly, these data have been carefully curatedby the personnel of the NIST Mass Spectrometry Data Center, yielding a large,high quality data set that is well suited for deep learning approaches.The full NIST 20 library contains nearly 307 000 compounds. For nearly112,000 of these molecules, an experimental Kov´ats retention index is available.The library contains data from 3 different column types. Among these, thelargest collection of experimentally determined Kov´ats retention indices is forthe “semi-standard non-polar” column type. Sets of experimental values forstandard polar and non-polar columns are an order of magnitude smaller. Thus,we have restricted the present study to data from experiments using the semi-standard non-polar column type, yielding a data set with 104 896 molecules.These data may be either single experimental values or a consensus value pro-duced from more than one measurement.In order to ensure that the final model contained sufficient examples of eachatom type, any molecule containing an atom that appeared in less than 50molecules in the data set was removed. Therefore, only molecules containing(in order of decreasing frequency) C, O, N, Si, F, Cl, S, Br, H, I, and P atoms re-mained. The atom appearing in the fewest number of molecules, P, was presentin 994 molecules in the data set, ensuring good representation during training.(H atoms are not represented in the input file except where required to ade-quately describe a molecular structure, hence the low frequency of H atomsin the preceding list.) We next examined histograms of molecular mass andretention index vales, seeking to ensure a reasonable density of data at the ex-8Approximate location of figure 1]
Figure 1: Histogram of the full Kov´ats retention index data set used for training, validation,and testing of the model. The red line plots a Gaussian distribution with the same mean andstandard deviation as the data set ( µ = 2141 . σ = 641 . treme values of these two quantities to avoid problems during model training.Molecules with mass less than 50 amu and more than 850 amu were removedfrom the set. Similarly, molecules with Kov´ats retention indices less than 300and more than 4100 were omitted. (This procedure removed 246 molecules fromthe data set.)The removal of molecules on the basis of mass and Kov´ats retention indexvalue give an indication of where the present model should work best. Althoughmachine learning methods can sometimes extrapolate effectively, this is notsomething the data permits us to test effectively, especially given the smallnumber of compounds removed. In the case of the mass limit, predictions ofderivatized compounds whose mass may easily exceed 850 amu should be usedwith appropriate care.The data set used for training, validation, and testing (after the removalsdescribed above) consisted of 104 650 molecules. A histogram of the retentionindex data is shown in Figure 1. As can be seen in the figure, the data arenearly Gaussian distributed ( µ = 2142, σ = 642). These data will be used fortraining and testing the model after normalization via computing the z-score, y ′ = y − µ y σ y (1)where y ′ is the value of the normalized experimental Kov´ats retention index, y is the corresponding unnormalized experimental value, µ y is the mean of y , and σ y is the standard deviation of y .The features extracted from the NIST 20 data set are the atomic identitiesand the list of chemical bonds (from the MolFile representation of the molecule).Our goal in this work is to produce a machine learning model capable of givingaccurate predictions of the Kov´ats retention index with a minimum of chemi-9al information needed as input. We deliberately restrict the model input toinformation on a molecule that can be obtained from a familiar 2D sketch ofthe molecule or obtained from a library of molecular structures (as we havedone in this work). It is hoped that by developing a model that relies only on simple structural information, the barrier to using the methodology for makingpredictions is as low as possible. As mentioned previously, we make use of the MatErial Graph Network(MEGNet) methodology developed by Chen et al.[24] The MEGNet model incor-porates a number of features that make it a good choice for machine learning ofmolecular properties. In particular, a graph neural network is employed.[30] Thegraph used by the machine learning model corresponds directly to the molecu-lar structure. The nodes/vertices in the graph correspond to the locations ofthe atoms, and the edges correspond to atomic pairs (there is an edge betweenany pair of atoms even if they are not formally chemically bonded). Thus, thechemical graph model may be directly mapped onto the graph network model.Properties of atoms and bonds may be assigned to the nodes and edges in thegraph model. The MEGNet framework also allows global (i.e. not associatedwith a single atom or chemical bond) properties of a molecule to be incorporatedinto the model.The MEGNet model is thoroughly described in the literature and online,where the source code is also available.[24, 31] We will only repeat the salientfeatures of the model so that it may be reproduced by others. MEGNet ver-sion 0.3.5 is used with TensorFlow version 1.14.1 and the Keras API. MEGNetencodes input data in 3 distinct vectors (called “attributes”): state attributes,edge attributes, and node attributes. State attributes are global properties ofthe molecule (e.g., numbers of atoms, molecular mass). Edge attributes arefeatures of atom pairs (e.g., bond order, member of ring). Node attributes arefeatures of the atoms themselves (e.g., atomic number, hybridization). Eachof these attributes are used to update the other attributes and are themselves10pdated during model fitting.Our MEGNet model incorporates 3 atom features (encoded as 18 one-hotvariables), 3 edge features (encoded as 7 one-hot variables), and 3 global features.The atom features are the atomic number (11 one-hot variables representing C,H, O, N, F, Si, P, S, Cl, Br, and I), the hybridization of the atom (6 one-hotvariables, s , sp , sp , sp , sp d , and sp d ), and the formal charge (1 integervariable) on the atom. The hybridization is calculated from the information inthe MolFile using RDKit. The edge features are the bond type (i.e., no bond,single, double, triple, or aromatic, 5 one-hot variables), whether the atoms werein the same ring (a single one-hot variable), and a graph distance (1 integervariable). The graph distance is calculated as the smallest number of edges thathave chemical bonds between the atoms in the pair (recall that an edge heremeans any pair of two atoms, not just those that are chemically bonded). We donot consider pairs with a graph distance higher than 5 to save computationalresources as discussed below. Global features are the number of heavy (non-hydrogen) atoms in the molecule, the molecular mass divided by the numberof heavy atoms, and the number of chemical bonds divided by the number ofheavy atoms.Our MEGNet model was used with 3 MEGNet blocks. These blocks arecomposed of two layers of densely-connected multi-layer perceptrons (MLP) anda graph neural network layer in which each of the attributes is successivelyupdated. The dense layers used 128 and 64 units, respectively. The MEGNetblock steps are followed by a ‘Set2Set’ layer [32] in which the output of the bondand atom attributes are mapped to the appropriate vector quantities. This isfollowed by a concatenation step and two densely-connected MLPs (64 and 32units).We have modified the MEGNet source code to limit the depth of atom pairproperties in order to save computer memory, which scales roughly as the squareof the number of atoms times the number of features. With up to 151 atoms in amolecule, the CPU memory requirements ran into the 100s of GB during modeldevelopment. (GPU memory was not a bottleneck when training the model,11owever, due to batching.) We also implemented a custom molecule class thatrelied on features generated using RDKit.[33]Training was carried out for up to 2000 epochs, with early stopping employedwith restoration of the best weights if the value of the loss function for thevalidation set did not improve for 150 epochs. The average number of epochs intraining was approximately 1400. A batch size of 32 was used during training.The Adam optimizer was used in fitting the model, with a learning rate of 2 × − . The MAE was used as the loss function instead of the usual mean squarederror (MSE) commonly used in training neural network models. The rectifiedlinear unit (ReLU) activation function was used in the MLP layers. Interestingly,the MEGNet paper describes use of the softmax activation function, but wefound better results were produced using ReLU activation. The output layerconsisted of a single MLP unit with linear activation. The experimental valuesof the Kov´ats retention index as taken from the NIST 20 library were used astarget values. These were normalized as described previously.Prior to training, the data set was split into three sets: training, validation,and testing in an 8:1:1 ratio. This was accomplished by randomly dividingthe full data set into 10 blocks and then using 8 blocks for training with theremaining blocks used for validation and testing. Due to the large size of the dataset, our testing set includes more than 10 000 retention indices. This division ofdata was convenient for the 10-fold cross-validation procedure we employed totest the reliability of our model.
3. Results and discussion
The model described above was produced via a trial and error process, con-sidering a large number of features and testing the sensitivity of the model toincluding/removing various features. For example, atom features for aromatic-ity, ring atoms, and ring size did not significantly improve the model. Thisis likely because these features contain information contained in bond features.Similarly, for bond features, we attempted to use 3D geometry information from12onversion of 2D structures and force field minimization, but abandoned thisapproach because we did not see significant improvement and did not want tounnecessarily complicate the model given our particular use case.We experimented with several different loss functions in an attempt to min-imize both the final MAE and the standard deviation of the absolute error ofthe predicted retention indices versus the experiment. Initial runs made use ofthe MSE as is common in training machine learning models. We found thatswitching to the MAE as the loss function produced results with a lower MAEthan training with the MSE loss function, while leaving the value of the stan-dard deviation of the error approximately constant. We tried weighted andunweighted combinations of MAE and MSE as well as combinations of the ℓ , ℓ , and ℓ ∞ norms. None of these produced results that were superior to theMAE in reducing the mean and standard deviation of the error, so the MAEwas used as the loss function for all subsequent model training.Our model training makes use of training, validation, and testing sets. Asthere is some inconsistent usage of these terms in the literature, we describethem here. The training set is used in the training of the model. For thiswork, we used 80 % of the data for this purpose. The validation set consisted of10 % of the data and was used to evaluate model performance during training.Use of a validation set can be a good way to produce more robust machinelearning models, reducing the mean and standard deviation of the error in thefinal results. Use of the validation set made a small improvement to our finalresults. Finally, the testing set was composed of the remaining 10 % of thedata and was used for evaluating the performance of the model using variousperformance metrics described below. As the training and validation sets areused in model fitting, they do not provide an unbiased estimate of the modelerror. However, testing on a fraction of the data can lead to models that arebiased. To overcome this problem, k -fold cross validation may be employed.Once we arrived at a model with acceptable performance, we used a 10-foldcross-validation scheme to test the reliability of the model. Cross validation hasbeen shown to provide a robust test of machine learning models. To carry out13et MAE RMSETraining 9.38 ± ± ± ± ± ± Table 1: Statistics of the 10-fold cross validation procedure. The mean value and standarddeviation of the mean absolute error (MAE) and the root mean square error (RMSE) over 10runs is given for each of the 3 sets used. k -fold cross validation, one divides the data set into k sets (or “folds”) and thentrains the model k times, each time with k − Figure 2: Plot of the training and validation loss function (MAE, upper panel) versus thetraining epoch. It can be clearly seen that overfitting sets in around 50–100 epochs but isstable thereafter. The same plot for the mean squared error (MSE), not used as the lossfunction during training, is given in the lower panel. index units. This is also evident in the plot of the training history presented inFigure 2. This figure also shows that the MSE (not used as a loss function dur-ing training) follows a similar pattern as the MAE, another indication of goodfitting. The plots of MAE and MSE show a rapid decrease in the error metricin the first 50 epochs followed by a slow drop to a minimal value. There doesnot seem to be any significant improvement in the performance of the modelpast 400 epochs. This behavior does not negatively affect our model becausewe recover the best model weights when early stopping is employed as is thecase in the figure. The similar values of the validation and testing set errors arean indication that the validation set is effective. Despite overfitting, the MAEand RMSE of the testing set are acceptable and are indicative of a good modelfor prediction of the Kov´ats retention index. The sample standard deviationsof both error metrics over the 10-fold cross validation procedure indicate thatthe model fitting is robust. The sample standard deviations for the trainingset are likely larger due to the larger amount of data in that set. However, thesmaller sample standard deviation of the error metrics in the testing set is astrong indication of good model performance.In order to further characterize the performance of the model, the resultsfrom a single run in the cross-validation set are now analyzed in detail. Theseresults are presented in Table 2. This run has an MAE nearly identical tothe mean produced by the 10-fold cross validation and thus serves as a goodrepresentative of the model. There are several interesting details to notice inthe table. The standard deviation of the absolute error (i.e., the RMSE) isconsistently larger than the standard deviation of the signed error. The mean(absolute) percent error (MPE) is 1 .
49 % for the testing set with a standard15uantity Training Validation Testing n
83 719 10 464 10 464mean ǫ − . − . − . .
69 54 .
61 57 . ǫ − − − ǫ | ǫ | .
60 12 .
20 12 . | ǫ | .
20 27 .
69 28 . s ( | ǫ | ) 18 .
03 47 .
09 50 . | % ǫ | .
32 0 .
60 0 . | % ǫ | .
52 1 .
42 1 . s ( | % ǫ | ) 1 .
00 2 .
56 2 . | % ǫ | .
60 49 .
13 66 . Table 2: Statistics describing the graph convolutional network model performance in predict-ing the Kov´ats retention index (RI) for the training, validation, and testing sets. The error iscalculated as ǫ = RI experiment − RI predicted . The sample standard deviation, s , is calculatedfor the signed and unsigned errors. deviation of 2 .
84 %. This compares favorably to MPE values of 1 .
96 % fromthe model of Matyushin et al. [1] and 5 . Figure 3: Histogram of mean absolute errors in predicting the Kov´ats retention index (exper-imental − predicted) in the training, validation, and testing data sets. [Approximate location of figure 4] Figure 4: Comparison of predicted to actual values of the Kov´ats retention index. The redline indicates perfect agreement, with the width of the line equal to 2 standard deviations ofthe prediction error.
NIST 20 database. The red line shows 2 standard deviations in the MAE foreach of the three sets.In order to further characterize the performance of the model presented inthis work, we performed comparisons with the group contribution model of Steinet al. [10] and the convolutional neural network (CNN) model of Matyushinet al. [1]. The models of Matyushin et al.[1] was reimplemented by us usingTensorFlow [34] and RDKit [33]. To fully characterize this model, we performed10-fold cross validation using the same learning rate as was used for the trainingof our model.Table 3 presents mean absolute errors and the sample standard deviationof the absolute mean errors for the three prediction models indicated for theindicated chemical functionalities. The results are ordered by the MAE of thepresent model, and the statistics of all compounds are included for comparisonpurposes. In all of the cases presented, the model of Stein et al. [10] exhibitsthe largest error statistics of the three models. This is consistent with theconclusions drawn by Matyushin et al. [1] and from the present work. TheStein model works particularly well for hydrocarbons and exhibits the largesterrors for O heterocycles. Strangely, the estimate of the Stein model is givenin NIST20 for only 35 of the 994 compounds containing P. Using the full setof P compounds improves the statistics as shown in the second entry labelled“contains P” in the table for both machine learning models. However, in bothcases the sample standard deviation is largely unchanged even though the MAEdecreases appreciably. 17he results from Table 3 allow us to explore the performance of the CNNmodel [1] versus the present model. The MAE for the CNN model is 5.50 to14.85 retention index units higher than the results from our graph neural network(GNN) model. When expressed as a percentage, the MAE values produced bythe CNN are 13% to 51% higher than the results from the GNN. When thesame comparison is done for the sample standard deviation, the CNN performsbetter with values of 0.56 to 8.85 retention index units higher than the GNNor from nearly equal up to 19% larger than the GNN. (Note the we have usedthe updated statistics for P containing molecules presented in the paragraphabove for this analysis.) We note that the performance of the CNN modelis generally very good as might be expected from the analysis performed byMatyushin et al. [1]. We suggest that the superior results for our model arelikely due to the power of the graph neural network, particularly when appliedto chemical problems, and to the larger size of our model. Finally, we note thatthe input data needed for either model is the same – a molecular representationthat captures connectivity – and that conversion between the input formats(SMILES for the CNN and MolFile for the GNN) is wasily accomplished by anumber of readily-available software libraries.The largest errors produced by the two neural network models come frommolecules with hydroxyl groups, i.e. alcohols and carboxylic acids. Performanceon heterocycles, ketones, and molecules that contain S also have larger MAEvalues as well as larger sample standard deviations. While we are not awareof any systematic bias in either model that would favor one class of compoundover another, these results suggest that there may be room for improvement byexplicitly considering these factors. However, this could also be a symptom ofdata with larger uncertainties.The performance of the three models is further demonstrated by comparingerror statistics for collections of molecules. This analysis is presented in Table4. The collections of molecules were gathered from various literature sourcesas follows. The list of “flavors and fragrances” were taken from the work ofRojas et al. [35]. The list of “fragrance-like” molecules was taken from another18olecule Stein Matyushin present worktype
N µ s µ s µ s ether 55755 112.60 120.21 34.66 52.16 22.99 43.87amide 24033 119.26 112.20 38.75 56.73 25.96 47.88contains O 92136 114.36 119.88 38.03 56.90 26.65 49.64 all compounds
Table 3: Evaluation of the performance of the model of Stein et al. [10], the model ofMatyushin et al. [1], and the model described in this article for selected chemical functionalitiesfor the (unitless) Kov´ats retention index. The column labelled N indicate the number of dataused in computing the mean ( µ ) and sample standard deviation ( s ) of the absolute error, | ǫ | = | RI experiment − RI predicted | . N µ s µ s µ s flavors andfragrances 675 37.94 43.59 18.96 24.40 16.30 17.93fragrance-like 662 36.81 39.10 18.46 21.48 16.13 17.87flavors 354 34.32 31.60 19.47 22.70 15.60 17.90essentialoils 473 41.04 40.86 21.94 22.87 19.09 20.37terpenes 39 31.82 20.99 15.35 13.91 11.89 10.33
Table 4: Evaluation of the performance of the model of Stein et al.[10], the model of Matyushinet al.[1], and the model described in this article for selected molecule types. The columnlabelled N indicates the number of data used in computing the mean ( µ ) and sample standarddeviation ( s ) of the absolute error, | ǫ | = | RI experiment − RI predicted | . See text for discussion. publication by Rojas et al. [36]. Another list of “flavors” was taken from thework of Yan et al. [19]. A list of “essential oils” was taken from the work ofBabushok et al. [37]. Finally, the list of “terpenes” was taken from a paper byJalali-Heravi et al. [6]. For the first three sets of molecules, SMILES stringswere provided and were used for matching compounds in our data set. SMILESstring for our data set were generated using RDKit [33]. For the essential oilcompounds, matches were made using CAS registry numbers.The data in Table 4 show marked improvement in the error statistics for allthree estimation schemes. The improvement is most pronounced for the modelof Stein et al. [10]. Nevertheless, the pattern that emerged in the previousanalysis remain intact. The CNN model of Matyushin et al. [1] exhibits errorsapproximately half of those from the Stein model, with the GNN model of thepresent work giving further improvement over the CNN results.The goal of the model presented in this article is to predict the Kov´ats reten-tion index with good accuracy. The best model is characterized by small values20f both the MAE and the standard deviation of the error. Our experience inworking with this data set leads us to believe that further improvement may notbe possible due to the uncertainty present in the experimental data. However,it is certainly possible that a more powerful machine learning model may leadto even better results. In the context of the present work, this might be ac-complished by using additional atom, bond, and global features. We note that,consistent with the approach used by some of the best contemporary machinelearning models, we could generate accurate molecular geometries using quan-tum chemistry techniques. However, this approach requires a significant amountof computer time and would violate our original goal of creating a model thatwas a drop-in replacement for models used in databases.Graph convolutional networks have proven to be a very powerful approach forproducing models with state-of-the-art accuracy as exemplified by the MEGNetmethodology. [24] For example, the MEGNet model was used to fit 13 propertiesof the QM9 data set [27, 28] with accuracy close to or exceeding SchNet[38] andthe neural message passing methodology previously applied to the same dataset.[39, 40] The results produced in the present study do not meet the incrediblylow mean absolute error targets enjoyed by these other studies. There are severalreasons that this is the case. First, the other studies were performed on the QM9data set.[41, 42] This data set consists of 133 885 species with up to nine heavyatoms (C, O, N, and F), with properties computed using DFT at the B3LYP/6-31G(2df,p) level of theory. As such, it comprises a set created with consistentmethodology and presumably has consistent error behavior as well. In contrast,the present work has approximately half of the data values, but explores a muchmore varied chemical space. For example, our data set has twice as many heavyatoms (i.e., atoms other than H), and contains molecules with up to 73 heavyatoms. Moreover, studies based on the QM9 data set use coordinates fromthe optimized geometries of the molecules to compute bond lengths to whichchemical properties are sensitively related. In contrast, we do not use bondlength, preferring to use only an indication of whether two atoms are bondeddue to the nature of our intended target. Most importantly, our target data21et consists of experimental values which are necessarily “noisy” and likely toinclude errors. So, while achieving as small an MAE as possible is desirable,we recognize that it is unlikely that we will be able to achieve MAEs of a fewretention index units.We did attempt to construct a model using SchNet, but were not able toachieve a model with acceptable accuracy. As noted above, SchNet is capableof state of the art accuracy on molecular properties. In this case, we believethe requirement that properties are calculated as a sum over atomic contribu-tions was not appropriate for predicting the retention index, a quantity thatis determined by dynamical interactions of a molecule with other molecules inthe chromatography column rather than internal contributions. Given that thegroup increment model makes a similar assumption at the level of chemicalgroups rather than atoms, we suggest that the group increment model may besimilarly deficient.We also attempted to construct a traditional neural network (multilayerperceptron) model by using histograms of inter-atomic distances and anglessuch as the methodology proposed by Collins et al.[43] As with other methods,we did not find that this method led to accurate results, though we do notpreclude the possibility that a successful model could be constructed in thismanner.Finally, we note that there is significant value in predictive models for theKov´ats retention index for mass spectral library searching. The improved perfor-mance afforded by the current model should lead to corresponding improvementsin library searches, smaller lists of possible matches with fewer false positives.The NIST library of retention index values is constantly growing and beingrefined. We anticipate that future releases of the library will permit even moreaccurate models that cover an even greater range of chemical functionalities.The value of such libraries, carefully curated by experts, cannot be overstated,particularly when used in machine learning applications.22 . Conclusion In this article, we have demonstrated the application of a graph neural net-work to the problem of predicting an experimentally-derived set of Kov´ats re-tention indices. Our model requires only the information that can be readilyobtained from a 2D representation of the molecule (i.e., a traditional drawing ofa molecule). We have trained our model using a diverse set of more than 100 000molecules, using 90 % of the data set in training and validation and using theremaining 10 % of the data set for testing the performance of the model. We findthat we are able to predict the Kov´ats retention index with an MAE of 28 (witha corresponding RMSE of 58) versus the experimental data. The performanceof the current model is very similar to the one reported by Matyushin et al.[1].However, the current model used a data set three times larger than the oneused by Matyushin et al. and our experience shows that larger data sets tend toproduce a larger MAE. Reparameterizing the model by Matyushin et al. usingthe latest NIST 20 data set yields slightly larger MAE and RMSE comparedto our model. Notably, both our model and that of Matyushin et al. performsignificantly better than the widely used model of Stein et al., based on groupincrement values which has an MAE of 114 and a standard deviation of 167 overthe same data set. [10] Both prediction methods use the same input, namelya 2D representation of the molecule as can be easily created using a variety ofchemical drawing programs. This approach has the benefit that estimates of theKov´ats retention index may be rapidly generated without the need for more de-tailed structural information that might be computed using quantum chemistryprograms at a cost of considerably more time. Conversely, the lack of detailedstructural information that is common in other machine learning applicationsmay ultimately limit the performance of the model used in this work.
Acknowledgement
The authors are grateful to Dr. William Wallace for providing the NIST 20data set and for helpful discussions pertaining to the content and usage of the23ata.
References [1] D. D. Matyushin, A. Y. Sholokhova, A. K. Buryak, A deepconvolutional neural network for the estimation of gas chromato-graphic retention indices, J. Chromatogr. A 1607 (2019) 460395–8. doi:10.1016/j.chroma.2019.460395 .[2] E. Kov´ats, Gas-chromatographische charakterisierung organischerverbindungen. teil 1: Retentionsindices aliphatischer halogenide, alko-hole, aldehyde und ketone, Helv. Chim. Acta. 41 (7) (1958) 1915–32. doi:10.1002/hlca.19580410703 .[3] W. P. Eckel, T. Kind, Use of boiling point-Lee retention index correla-tion for rapid review of gas chromatography-mass spectrometry data, Anal.Chim. Acta 494 (2003) 235–243. doi:10.1016/j.aca.2003.08.003 .[4] A. Bruchmann, P. Zinn, C. Haffer, Prediction of gas chromatographic reten-tion index data by neural networks, Anal. Chim. Acta 283 (1993) 869–880. doi:10.1016/0003-2670(93)85300-9 .[5] M. Pompe, M. Noviˇc, Prediction of gas-chromatographic retention indicesusing topological descriptors, J. Chem. Inf. Comput. Sci. 39 (1999) 59–67. doi:10.1021/ci980036z .[6] M. Jalali-Heravi, M. H. Fatemi, Artificial neural network mod-eling of Kovats retention indices for noncyclic and mono-cyclic terpenes, J. Chromatogr., A 915 (1-2) (2001) 177–183. doi:10.1016/S0021-9673(00)01274-7 .[7] H. Li, Y. X. Zhang, L. Xu, The study of the relationship between thenew topological index a m and the gas chromatographic retention indicesof hydrocarbons by artificial neural networks, Talanta 67 (2005) 741–748. doi:10.1016/j.talanta.2005.03.031 .248] B. ˇSkrbi´c, A. Onjia, Prediction of the Lee retention indices of polycyclicaromatic hydrocarbons by artificial neural network, J. Chromatogr., A 1108(2006) 279–294.[9] M. L. Lee, D. L. Vassilaros, C. M. White, Retention indices for programmed-temperature capillary-column gas chromatography of polycyclic aromatichydrocarbons, Analytical chemistry 51 (6) (1979) 768–773.[10] S. E. Stein, V. I. Babushok, R. L. Brown, P. J. Linstrom, Estimation ofKov´ats retention indices using group contributions, J. Chem. Inf. Model.47 (2007) 975–980. doi:10.1021/ci600548y .[11] V. V. Mihaleva, H. A. Verhoeven, R. C. H. de Vos, R. D. Hall, R. C.H. J. van Ham, Automated procedure for candidate compound selection inGC-MS metabolomics based on prediction of Kovats retention index, Bioin-formatics 25 (6) (2009) 787–794. doi:10.1093/bioinformatics/btp056 .[12] A. R. Katritzky, M. Kuanar, S. Slavov, C. D. Hall, M. Karelson, I. Kahn,D. A. Dobchev, Quantitative correlation of physical and chemical propertieswith chemical structure: Utility for prediction, Chemical Reviews 110 (10)(2010) 5714–5789. doi:10.1021/cr900238d .[13] S. Kumari, D. Stevens, T. Kind, C. Denkert, O. Fiehn, Applying in-silico retention index and mass spectra matching for identification of un-known metabolites in accurate mass GC-TOF mass spectrometry, Analyt-ical Chemistry 83 (15) (2011) 5895–5902. doi:10.1021/ac2006137 .[14] E. L. Schymanski, M. Meringer, W. Brack, Automated strategies to identifycompounds on the basis of GC/EI-MS and calculated properties, AnalyticalChemistry 83 (3) (2011) 903–912. doi:10.1021/ac102574h .[15] J. Zhang, A. Fang, B. Wang, S. H. Kim, B. Bogdanov, Z. Zhou, C. McClain,X. Zhang, imatch: A retention index tool for analysis of gas chromatogra-phy–mass spectrometry data, Journal of Chromatography A 1218 (2011)6522–6530. doi:10.1016/j.chroma.2011.07.039 .2516] V. I. Babushok, N. R. Andriamaharavo, Use of large retentionindex database for filtering of GC–MS false positive identifica-tions of compounds, Chromatographia 75 (11-12) (2012) 685–692. doi:10.1007/s10337-012-2231-7 .[17] V. I. Babushok, P. J. Linstrom, J. J. Reed, I. G. Zenkevich, R. L. Brown,W. G. Mallard, S. E. Stein, Development of a database of gas chromato-graphic retention properties of organic compounds, J. Chromatogr. A 1157(2007) 414–421. doi:10.1016/j.chroma.2007.05.044 .[18] E. L. Schymanski, C. M. J. Gallampois, M. Krauss, M. Meringer, S. Neu-mann, T. Schulze, S. Wolf, W. Brack, Consensus structure elucidation com-bining GC/EI-MS, structure generation, and calculated properties, Analyt-ical Chemistry 84 (7) (2012) 3287–3295. doi:10.1021/ac203471y .[19] J. Yan, J.-H. Huang, M. He, H.-B. Lu, R. Yang, B. Kong, Q.-S. Xu, Y.-Z.Liang, Prediction of retention indices for frequently reported compoundsof plant essential oils using multiple linear regression, partial least squares,and support vector machine, Journal of Separation Science 36 (15) (2013)2464–2471. doi:10.1002/jssc.201300254 .[20] I. Koo, X. Shi, S. Kim, X. Zhang, iMatch2: Compound identifica-tion using retention index for analysis of gas chromatography–massspectrometry data, Journal of Chromatography A 1337 (2014) 202–210. doi:10.1016/j.chroma.2014.02.049 .[21] I. G. M. Anthony, M. R. Brantley, A. R. Floyd, C. A. Gaw, T. Solouki,Improving accuracy and confidence of chemical identification by gaschromatography/vacuum ultraviolet spectroscopy-mass spectrometry: Par-allel gas chromatography, vacuum ultraviolet, and mass spectrome-try library searches, Analytical Chemistry 90 (20) (2018) 12307–12313. doi:10.1021/acs.analchem.8b04028 .[22] A. K. Zhokhov, A. Y. Loskutov, I. V. Rybal’chenko, Methodological ap-proaches to the calculation and prediction of retention indices in capillary26as chromatography, Journal of Analytical Chemistry 73 (3) (2018) 207–220. doi:10.1134/S1061934818030127 .[23] D. Weininger, SMILES, a chemical language and information sys-tem. 1. introduction to methodology and encoding rules, Journal ofChemical Information and Computer Sciences 28 (1) (1988) 31–36. doi:10.1021/ci00057a005 .[24] C. Chen, W. Ye, Y. Zuo, C. Zheng, S. P. Ong, Graph networks as a universalmachine learning framework for molecules and crystals, Chem. Mater. 31 (9)(2019) 3564–3572. doi:10.1021/acs.chemmater.9b01294 .[25] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensivesurvey on graph neural networks, IEEE Transactions on Neural Networksand Learning Systems doi:10.1109/TNNLS.2020.2978386 .[26] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz,G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley, O. A. von Lilienfeld,Machine learning prediction errors better than DFT accuracy (2017). arXiv:1702.05532 .[27] L. C. Blum, J.-L. Reymond, 970 million druglike small molecules for virtualscreening in the chemical universe database GDB-13, J. Am. Chem. Soc.131 (2009) 8732.[28] M. Rupp, A. Tkatchenko, K.-R. M¨uller, O. A. von Lilienfeld, Fast andaccurate modeling of molecular atomization energies with machine learning,Physical Review Letters 108 (2012) 058301.[29] NIST standard reference database 1A: NIST/EPA/NIH mass spectral li-brary (NIST 20) (2020). doi:10.18434/T4H594 .[30] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. F. Zam-baldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner,C¸ . G¨ul¸cehre, H. F. Song, A. J. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani,27. R. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli,M. Botvinick, O. Vinyals, Y. Li, R. Pascanu, Relational inductive biases,deep learning, and graph networks (2018). arXiv:1806.01261 .[31] MEGNet: MatErials Graph Network (2020).URL https://github.com/materialsvirtuallab/megnet [32] O. Vinyals, S. Bengio, M. Kudlur, Order matters: Sequence to sequencefor sets (2015). arXiv:1511.06391 .[33] Rdkit: Open-source cheminformatics (2020).URL [34] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Good-fellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Mur-ray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas,O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng,TensorFlow: Large-scale machine learning on heterogeneous systems, soft-ware available from tensorflow.org (2015).URL [35] C. Rojas, P. R. Duchowicz, P. Tripaldi, R. P. Diez, QSPR analysis for the re-tention index of flavors and fragrances on a OV-101 column, Chemom. Intell.Lab. Syst. 140 (2015) 126–132. doi:10.1016/j.chemolab.2014.09.020 .[36] C. Rojas, P. R. Duchowicz, P. Tripaldi, R. Diez, Quantitative struc-ture–property relationship analysis for the retention index of fragrance-likecompounds on a polar stationary phase, J. Chromatogr. A 1422 (2015)277–288. doi:10.1016/j.chroma.2015.10.028 .[37] V. I. Babushok, P. J. Linstrom, I. G. Zenkevich, Retention indices for28requently reported compounds of plant essential oils, J. Phys. Chem. Ref.Data 40 (2011) 043101. doi:10.1063/1.3653552 .[38] K. T. Sch¨utt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, K.-R.M¨uller, SchNet – a deep learning architecture for molecules and materi-als, J. Chem. Phys. 148 (2018) 241722–11. doi:10.1063/1.5019779 .[39] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl,Neural message passing for quantum chemistry, CoRR abs/1704.01212. arXiv:1704.01212 .[40] P. B. Jørgensen, K. W. Jacobsen, M. N. Schmidt, Neural message pass-ing with edge updates for predicting properties of molecules and materials(2018). arXiv:1806.03146 .[41] L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumerationof 166 billion organic small molecules in the chemical universe databaseGDB-17, J. Chem. Inf. Model. 52 (2012) 2864–2875.[42] R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantumchemistry structures and properties of 134 kilo molecules, Scientific Data1.[43] C. R. Collins, G. J. Gordon, O. A. von Lilienfeld, D. J. Yaron, Constant sizedescriptors for accurate machine learning models of molecular properties,J. Chem. Phys. 148 (2018) 241718–11. doi:10.1063/1.5020441doi:10.1063/1.5020441