MultiDK: A Multiple Descriptor Multiple Kernel Approach for Molecular Discovery and Its Application to The Discovery of Organic Flow Battery Electrolytes
MMultiDK: A Multiple Descriptor Multiple KernelApproach for Molecular Discovery and ItsApplication to The Discovery of Organic FlowBattery Electrolytes
Sung-Jin Kim, Adri´an Jinich, and Al´an Aspuru-Guzik ∗ Department of Chemistry and Chemical Biology, Harvard University
E-mail: [email protected]
Abstract
We propose a multiple descriptor multiple kernel (MultiDK) method for efficientmolecular discovery using machine learning. We show that the MultiDK method im-proves both the speed and the accuracy of molecular property prediction. We applythe method to the discovery of electrolyte molecules for aqueous redox flow batteries.Using multiple-type - as opposed to single-type - descriptors , more relevant featuresfor machine learning can be obtained. Following the principle of the ’wisdom of thecrowds’, the combination of multiple-type descriptors significantly boosts predictionperformance. Moreover, MultiDK can exploit irregularities between molecular struc-ture and property relations better than the linear regression method by employingmultiple kernels - more than one kernel functions for a set of the input descriptors.The multiple kernels consist of the Tanimoto similarity function and a linear kernelfor a set of binary descriptors and a set of non-binary descriptors, respectively. UsingMultiDK, we achieve average performance of r = 0 .
92 with a set of molecules for a r X i v : . [ phy s i c s . c h e m - ph ] J un olubility prediction. We also extend MultiDK to predict pH-dependent solubility andapply it to solubility estimation of quinone molecules with ionizable functional groupsas strong candidates of flow battery electrolytes. Introduction
Aqueous organic flow batteries are emerging as a low-cost alternative to store renewableenergy . For example, Huskinson et al., Yang et al., and Liu et al. experimentally showedthat high capacity energy storage can be achieved using earth abundant organic electrolytessuch as quinone molecules . Given the vast molecular space covered by all possible quinonemolecules, high-throughput computational screening is important to find electrolytes thatsatisfy the stringent requirement of aqueous flow batteries. In particular, the flow batterysystem in requires a redox potential greater than 0.9V for a catholyte and less than 0.2Vfor an anolyte, as well as a solubility greater than one molar for both electrolytes. Moreover,quinone electrolytes operating in acid (pH 0) and alkaline (pH 14) flow battery environmentswere demonstrated in and , respectively.Recent high-throughput computational screening of benzo-, naphtho-, anthra-, and thiopheno-quinone libraries demonstrated that the reduction potential of these redox couples can bepredicted accurately utilizing molecular quantum chemistry methods and linear regressions.Using the free energy of solvation as a proxy descriptor, the molecular solubility of elec-trolytes was also predicted in both references. Here, we build upon this work by developinga machine learning strategy that results in strong correlations with experimental solubilitydata predicts the required molecular properties in order to accelerate molecular screening byseveral orders of magnitude.The computational prediction of molecular solubility has been a research topic for decades,with most research being driven by the field of drug discovery . However, predictingthe solubility of organic electrolytes is particularly challenging, given the stringent targetsolubilities and the extreme pH values of flow battery electrolyte solutions . While the tar-2et solubility of drug molecules is generally less than 0.1 molar, the target for flow batteryorganic electrolytes can be more than 1 molar. Moreover, molecular libraries to screen po-tential flow battery electrolytes include extremely acidic or basic organic molecules whilethe majority of drug candidates are relatively weak acids and bases .Both machine learning and quantum chemical approaches can be used to estimate molec-ular solubility. Whereas machine learning approaches predict solubility based on trainingto experimental data , quantum chemistry aims to predict solubility from first princi-ples . Although quantum chemical approaches are preferable for obtaining a mech-anistic understanding of underlying principles , our focus here is on machine learningapproaches which facilitate high-throughput and artificially-intelligent molecular discov-ery .Machine learning approaches can be categorized into three types of methods accordingto the types of descriptors used: property-based methods, structure-based methods, andfunctional group-based methods (Table 1). Property-based methods predict physicochemi-cal values based on molecular properties which can be measured experimentally or obtainedfrom computational approaches. One such property used for solubility estimation is the par-tition coefficient, the logarithm of which is denoted as logP . Several methods have beenproposed to calculate logP . The general solubility estimation method (GSE), with itsextended and modified variants, is an example of a property-type method which estimateslogS from logP . On the other hand, structure-based methods rely on the estimation ofsolubility as a function of molecular structure. Structure is usually represented by a binaryfingerprint, consisting of molecular topology, connectivity, or fragment information . Fi-nally, group-based methods partition molecules into functional groups, and the contributionof each to the value of a physicochemical property is estimated .Property-based methods generally involve fewer regression parameters than the other twoapproaches, but require additional computation in order to estimate intermediate proper-ties included in the descriptor set. If large experimental data is available for intermediate3roperties such as logP, property-based methods can predict solubility for a wider range ofmolecules than any of the other methods . However, a significant gap between logP-basedestimation and experimental solubility still remains . Large efforts have been devoted to re-duce this gap by adding more input information to the set of descriptors, with a concomitantincrease in the complexity of the regressions employed .Two examples of property-based methods, the GSE approach and Delaney’s extendedGSE (EGSE) approach, rely on two and three fitted parameters, respectively. In , the pre-diction performances of GSE and EGSE were shown to be r (GSE) = 0 .
67 and r (EGSE) =0 .
69 for a dataset of 1305 compounds compiled by the authors, which highlights the gapbetween prediction and experiment for such methodologies.Structure-based methods predict solubility directly from molecular structural informa-tion, which can be implemented by various types of descriptors . Generally, binaryfingerprints offer a good trade-off between simplicity and predictive power . We re-cently developed the concept of neural fingerprints which are structure-based and application-specific with input descriptors generated for arbitrary size and shape based on a moleculargraph .Zhou et al. predicted molecular solubility using a binary circular fingerprint descriptor .Although they demonstrated a prediction performance of r = 0 .
83, the authors had to care-fully select the training data set in order to achieve that value of r . Huuskonen showedthat a prediction performance of r = 0 .
92 can be achieved by using non-binary descriptorsconsisting of 53 parameters, including 39 atom-type electro-topological state (E-state) in-dices . However, non-binary descriptors significantly increase computational cost in boththe training and validation stages, especially when feature selection is encountered duringthe regression process . A different binary fingerprint approach has been investigated byLind and Maltseva, in which support vector regression employing the Tanimotto similaritykernel is applied in order to overcome the limit of the multiple linear regression method .The group-based methods integrate contributions of all associated functional groups mul-4iplied by the number of each functional group in a compound: C + (cid:80) Ni =1 C i G i where G i isthe number of times the i th group appears in the compound, C is a constant bias parameter,and C i is the contribution of the i th group . Hou et al. proposed an atom contributionmethod, which overcomes the ’missing fragment’ problem in pure group contribution meth-ods . The atom contribution method categorizes atoms together with their surroundingmolecular environment. Cheng et al. used functional key descriptors such as MACCS Keysand PC881 instead of directing counting numbers of each functional group. This approachsimplifies descriptor values to be binary form but ’missing fragment’ and requiring a largetraining data set are still unavoidable for the cases of small and large number of the keys,respectively. Moreover, Cheng et al. apply them for solubility classification task with a muchlower solubility requirement, 10 µ g/mL, than the threshold values necessory for aqueous flowbattery applications.Table 1: A categorization of solubility estimation methods. First, machine learning andquantum calculation methods are depicted. The machine learning methods include theproperty-based, structure-based and group-based method.Category MethodsMachineLearning Property-based method Structure-based method
Group-based method
Quantum Calculation
The ability to carry out solubility predictions that account for pH-dependence is crit-ical to discovering molecules for aqueous flow batteries. In addition to mandating veryhigh solubility, the pH required to operate an organic flow battery system varies depend-ing on the required redox potential values and other experimental considerations. For in-stance, negative electrolytes of 9,10-anthraquinone-2,7-disulphonic acid (AQDS) in and2,6-dihydroxyanthraquinone (DHAQ) require 1 molar solubility at pH 0 and pH 14, re-spectively. While prediction methods for intrinsic solubility have been widely discussed,methods to predict pH-dependent solubility have remained less explored . In the-ory, the Henderson-Hasselbach relationship can be used to predict pH-dependent solubility5ased on the intrinsic solubility of a molecule . However, the limitations of current pKaprediction accuracies as well as the salt plateau phenomena of ionic solubility encouragethe use of a data-driven approaches. This requires significantly more experimental trainingdata (solubility as a function of pH) than intrinsic solubility prediction . Moreover, theintrinsic solubility of extremely strong acids with a negative pKa value has not been wellinvestigated in the literature.In high-throughput molecular screening, the development of an accurate and cost-effectiveproperty estimation method is a key factor for successfully finding new candidate molecules .In this work, we develop a fast and accurate property estimation method for high-throughputmolecular discovery. We named the proposed approach a multiple descriptor multiple kernel(MultiDK) method. The method relies on combining an ensemble of different descriptors,including fingerprints, functional keys, as well as other molecular physicochemical properties.We also apply different kernels for different types of descriptors to overcome intrinsic irregu-larities between a fingerprint and a property . Both intrinsic and pH-dependent solubilityestimations are supported by the MultiDK approach. Methods
Datasets and Tools
We tested the performance of MultiDK on four datasets. The four datasets include 1676molecules from , 496 molecules from , 1140 molecules from and 3310 molecules from .The 1676 molecule dataset includes most of the 1297 molecules in . The tests were per-formed using 20-fold cross-validation. In this work, we use Python packages including Pan-das , Scikit-learn , Tensorflow and Seaborn for data manipulation, machine-learning,and visualization tools. 6 ultiDK method In this paper, we compare the prediction performance of the MultiDK method against singledescriptor (SD) and multiple descriptor (MD) methods. The SD method uses only onetype of a descriptor, such as a Morgan fingerprint, MACCS keys or a specific molecularphysicochemical property. Morgan fingerprints represent an atom and path structure ofa molecule using a binary hashing procedure. MACCS keys represent functional groupinformation. For molecular properties, we include molecular weight, Labute’s approximatesurface area (LASA), or the logarithm partition coefficient (logP). The MD and MultiDKmethods include more than one descriptor. Both the Morgan fingerprint and the MACCSkeys are binary descriptors while the physicochemical molecular property is a non-binary,real-valued descriptor.The MultiDK approach predicts the target molecular property as follows: y = (cid:88) i =1 ,...,L w B i k B ( x B , x B i ) + w NB x NB + w (1)where x B and x NB are binary and non-binary descriptor vectors, respectively. x Bi is a binarydescriptor vector for the i th training molecule, w B and w NB are weight vectors correspondingto x B and x NB , respectively, L is the number of a training molecules, and k B ( · ) is a binarykernel function.Rather than using a single kernel or linear regression, MultiDK utilizes multiple kernelssuch as a nonlinear binary kernel for binary descriptors and linear processing for non-binarydescriptors separately. To optimize a kernel function , multiple combinatorial kernelshave been used in various applications including biomedical data and YouTube videodata . Here, we use a multiple kernel approach to apply appropriate kernels for dif-ferent features instead of training the kernel. The binary kernel function of k B ( · ) contributesby exploiting a non-linear relationship between the molecular structure and property. Thenon-linear relationships arise primarily because each bit indicates the presence or absence of7 pattern rather than a quantitative value. MultiDK uses all training molecules as supportvector molecules for kernel processing similar to support vector machines. We use the Tani-moto kernel which has been used in a wide range of machine learning applications, such asexploiting binary feature information to recognize white images on a black background aswell as a kernel for support vector and Gaussian progress regression in molecular propertyprediction .In the MultiDK approach, ensemble learning is employed based on multiple combina-tional descriptors according to the principle of the ’wisdom of the crowds’ . The set ofdescriptors in MultiDK includes the Morgan circular fingerprints , MACCS Keys finger-prints and three non-binary molecular properties. The three types of descriptors representstructure hash (atom, path) and structure pattern (key, functional group) and target relatedmolecular properties. We find that this ensemble combination is effective to predict molec-ular properties because both atom and subgroup representations are employed in the set ofdescriptors together with the related molecular properties. Moreover, we use different kernelsfor binary and non-binary descriptors. Particularly, a binary similarity kernel is applied tothe binary descriptor and a linear kernel for the non-binary descriptor.We evaluate the methods with training and cross-validation phases. In the training phase,we optimize the regression parameters using Ridge regularization. The descriptor consistsof 4096 binary bits of the Morgan circular fingerprint with radius 6, 117 binary bits of theMACCS Keys and a few non-binary scalar descriptors. We generate all descriptors usingthe RDKit tool except for the partition coefficient, which we obtain from Cxcalc fromthe Chemaxon Marvin suite . Before linear regression, we pre-process the 4213 binary bitswith the binary similarity kernel by calculating the Tanimoto similarity between an inputvector and the set of training vectors. We pass the non-binary descriptors directly to thelinear regression stage without pre-processing. Then, the binary kernel output values andthe direct non-binary output value are entered into the Ridge linear regression stage. Weemploy the Ridge regression routine in the scikit-learn Python package . The regularization8rocess eventually produces the best regression coefficients and an intercept correspondingto the maximum R performance. In the cross-validation phase, a combination vector ofthe binary kernel outputs and a direct descriptor of a test molecule is multiplied by thecoefficients obtained in the training phase. MultiDK for estimating intrinsic solubility, logS
We use MultiDK for solubility prediction as follows:log S = (cid:88) i =1 ,...,L w CK i k B ( x CK , x CK i ) + ( w WSP · x WSP ) + w (2)where the subindices C, K, W, S, and P represent the Morgan circular fingerprint, theMACCS keys, the molecular weight, Labute’s approximate surface area (LASA) (Labute2000) and the logarithm partition coefficient (logP), respectively. L is the number of a train-ing molecules, k B ( · ) is a binary kernel function, x MCMK = [ x MC , x MK ] is a concatenatedbinary vector for an input molecule, x ai is a concatenated binary vector of the i th supportingmolecule, and x MW is molecular weight (MW). Both w MCMK i and w MW are regression coef-ficients and w is the regression intercept. The values of x MC , x MK and x MW are generatedaccording to the SMILES string of a molecule. MultiDK for estimating pH dependent solubility, logS(pH)
In order to predict pH-dependent solubility, we extend the MultiDK method as follows:log S (pH) = log S + log P − log D (pH) (3)where log P and log D (pH) are the n -octanol-to-water partition coefficient and the pH-dependent distribution coefficient, respectively. Since the two coefficients can be approx-imated as log P = log S Oct − log S and log D (pH) = log S Oct − log S (pH) , we are able to9xtend MultiDK as in (3) where log S Oct is solubility in octanol. The octanol solubility isintrinsic and therefore determined regardless of existence of ionizable groups . We evaluateboth log P and log D (pH) using the cxcalc plugin in the Chemaxon Marvin suite . Results and Discussion
Cross-validation results
Performance of MultiDK for solubility prediction
We use r distribution of 20-fold cross validation as a metric of prediction performance. The r distribution is obtained by 20 time repetition of both training and testing until 20 subsetsof data are all used for validation. Figure 1 shows the r distribution obtained with each ofthe methods tested as a function of the Ridge regression hyper-parameter α . Here, we usedthe 1676 unique molecules in . For efficient comparison, only one non-binary descriptor isconsidered in this evaluation. Both the MultiDK and the MD methods employ two binaryand one non-binary descriptors where the two binary and one non-binary descriptors areMorgan fingerprints (MFP), MACCS Keys (MACCS) and molecular weight (MolW). Asshown in the figures, MultiDK and MD significantly outperform SD. Moreover, MultiDKis most robust to changes in the value of α . This result reveals that additional group andproperty information help improve the regression performance.In Figure 2, the performances of SD family, MD and MDMK are compared when theoptimal value of α is used, where the SD family includes MFP, MACCS and MolW. Thisbar graph shows a clear difference between the SD family, MD and MultiDK approaches.The best α value are found by a grid search approach which selects α on the basis ofregression performance in the range of 10 − to 10 with 10 logarithmically equally spacedsteps. Each regression performance is evaluated using a 20-fold cross-validation with initialdata shuffling. SD (MFP), MD and MDMK achieve their best regression coefficient values10f r ± std( r ) = 0.72 ± ± ± α = 10.0, 31.6 and 0.03,respectively. This result highlights three important points. First, SD with MFD outperformsthe other two SDs approaches, SD using MACCS and SD using MolW. It suggests thatdetailed structural information helps to estimate solubility. MolW is one non-binary valueand MACCS and MFB consist of 117 and 4069 binary values, respectively. Second, both MDand MultiMK outperform SD, which emphasizes the necessity of multiple type descriptors foraccurately estimating molecular properties. Third, MultiDK can further improve predictionperformance in comparison to MD through the use of a binary kernel regression.Figure 1: Solubility prediction as a function of the Ridge regression hyperparameter α forthe SD, MD and MDMK cases. For each α in 10 − to 10 , a 20-fold cross-validation wasapplied. Performance of MultiDK with more descriptors
The r distributions of different methods on the 1676 molecules using more descriptors areshown in Figure 3 where the box represents the interquartile range of r values, i.e., thedifference between the first quartile and the second quartile, and the median of them isdrawn inside the box. The numerical values of them are shown in Table 2. We include two11igure 2: , the performances of SD, MD and MDMK are compared when the optimal valueof α is used. For SD, molecular weights (MolW), MACCS Keys (MACCS) and Morganfingerprint (MFP or SD) are independently used as a descriptor.12ore non-binary descriptors which are Labute’s approximate surface area (LASA) andthe logarithm partition coefficient (logP). Paricularly for MultiDK, we include a methodwith separate binary kernels for each binary descriptor. MD xy and MultiDK xy represent amethod which embeds x binary and y non-binary descriptors. Figure 4 shows a comparisonof the experimental data and the MultiDK results obtained through cross-validation withthe best α . We obtained the following cross-validation summary statistics: mean( r ) = 0.91,std( r ) = 0.027, root mean squared error (RMSE) = 0.61, mean absolute error (MSE) =0.45, median absolute error (MSE) = 0.33.Figure 3: Prediction performance of different methods with the dataset with 1676 molecules.Table 2: 20-fold cross-validation performances of the 1676 moleculesMethod Best α E[ r ] std( r )SD 1E+1 0.72 0.06MD21 3E+1 0.86 0.05MD23 3E+1 0.88 0.03MultiDK10 1E-3 0.80 0.04MultiDK21 3E-2 0.89 0.04MultiDK23 1E-1 0.91 0.0313igure 4: Comparision of the 1676 experimental solubility data and cross-validation resultsof MultiDK using the optimal value of α . Performance of MultiDK for other datasets
The three more datasets of 496 molecules , 1140 molecules and 3310 molecules areconsidered in order to verify the proposed MultiDK method as shown in Figures 5, 6, and 7,respectively. The average values and standard deviation of r obtained across multiple crossvalidation iterations are illustrated in Table 3. From the figures and the table, we confirmthat the performance of MultiDK are better than MD for all new three data sets when thesame input descriptors are used. Moreover, SD with only MFP is shown to be the worstamong all cases, which is equivalent to the previous 1676 molecule case. Application to the prediction of quinone electrolytes
Intrinsic solubility prediction of quinone molecules
Next, we apply the MultiMK method to predict the solubility of a set of quinone molecules,which are useful electrolytes for organic aqueous flow batteries. The intrinsic solubility is14igure 5: Prediction performance of different methods with the dataset with 496 molecules.Figure 6: Prediction performance of different methods with the dataset with 1140 molecules.15igure 7: Prediction performance of different methods with the dataset with 3310 molecules.Table 3: Performances of solubility prediction for different datasetsMethod 496 molecules 1140 molecules 3310 moleculesBest α E[ r ] std( r ) Best α E[ r ] std( r ) Best α E[ r ] std( r S = 0 . − . C log P − . . − . . As shown in Figure 9, regardless of themolecule types or the attached R-groups, all three methods predict the intrinsic solubil-ity (logS) of the molecules to be below zero log-molar. Thus, all molecules have intrinsicsolubility less than the solubility target of the aqueous flow battery. pH-dependent solubility for single R-group quinones In Figure 10, 11 and 12, we show pH-dependent solubility predicted by the extended MultiDKmethod. We applied the extended method to the three types of quinone family molecules.Figure 10 shows the predicted pH-dependent solubility for five BQ molecules which are BQwith a sulfonic acid (SO H), phosphori acid (PO H), carboxylic acid (COOH) and hydroxide(OH) or no R group. The BQ with a sulfonic acid, phosphoric acid, carboxylic acid are shownto be the best soluble molecules at at pH=0, 7, and 14, respectively.Figure 11 shows predicted pH-dependent solubility of 13 NQ molecules which are NQwith one of the same four R-group to the BQ case or no R group. Figure 12 shows the17igure 8: Predicted solubility of 27 quinone molecules by three different methods, i.e.,MultiDK, VCCLAB and EGSE, where Benzoquinone (BQ), naphthoquinone (NQ) and an-thraquinone (AQ), with available unique positions of R-group attachment.18igure 9: Three sets of predicted solubility values for 27 quinones compared against eachother. Solubility values were predicted using the MultiDK, VCCLAB and EGSE methods.The three methods show that the predicted intrinsic solubility values of the 27 quinonesare lower than 0 log molar, regardless of the attached functional group. 0 log molar is thegeneral solubility requirement of electrolytes for inexpensive organic aqueous flow batteryapplications. 19redicted pH-dependent solubility of 9 AQ molecules which are AQ with one of the samefour R-group to the BQ and NQ cases or no R group. Both the NQ and AQ with a sulfonicacid and phosphori acid are shown to be the best soluble molecules at at pH=0 and 14,respectively, while both the NQ and AQ with hydroxide and no R-group are less solublethan the other molecules at pH=7.Figure 10: Predicted pH-dependent solubility of benzoquinones (BQ) with different func-tional groups. The legend describes R groups enumerated with BQ. Depending on pH, thesolubility values of the quinones with a R-group significantly vary. pH-dependent solubility of multiple R-group anthraquinones
We predict the pH-dependent solubility of quinone molecules with multiple R-groups. Par-ticularly, anthraquinone with multiple sulfonic acid groups and multiple hydroxyl groups areconsidered. Figure 14 shows structures of anthraquinones with zero, one, two and three sul-fonic acid or hydroxyl groups. Quinone molecules with attached sulfonic acid group are par-ticularly interesting since they display high solubilities and desirable redox potential values.In particular, 9,10-anthraquinone-2,7-disulphonic acid was chosen as a negative electrolyte and 1,2-dihydrobenzoquinone- 3,5-disulfonic acid was selected as a positive electrolyte for20igure 11: Predicted pH-dependent solubility of naphthoquinones (NQ) with different func-tional group substituents. Three unique positions are available to attach functional groupsin NQ.Figure 12: Predicted pH-dependent solubility of anthraquinone (AQ) with different func-tional group substituents. Two unique positions are available to attach functional groups inAQ. 21igure 13: Heatmap of predicted pH-dependent solubility of all three quinone families withdifferent functional group substituents.the acid quinoe flow batteries. The alkaline quinone flow battery embodies 2,6-dihydroxy-9,10-anthraquinone (2,6-DHAQ) as a negative electrolyte, and the experiment solubility of2,6-DHAQ is reported as more than 0.6 M in 1 M KOH .Figure 15 show that anthraquinone with no such R-groups is far insoluble in any pHcondition while Table 4 picks solubility at pH 0, 7, 14 and includes prediction results byChemaxon Cxcalc with logS plug-in as well as the extended MultiDK method. The MultiDKprediction shows that more sulfonic acid groups, more soluble, such as P log S pH (AQTS) >P log S pH (AQDS) > P log S pH (AQS) (cid:29) P log S pH (AQ), in all pH condition including the acidcase and more hydroxyl groups, more soluble, such as P log S pH (THAQ) > P log S pH (DHAQ) > P log S pH (HAQ) (cid:29) P log S pH (AQ), in alkali condition. Therefore, it is noteworthy thatan efficient prediction method should clearly differentiate between the solubility of an enu-merated molecule according to the number of ionic functional groups in every pH points.The MultiDK with pH-dependent solubility estimation can be used as a more practical toolthan the intrinsic solubility prediction method especially for the application of dicoveryingorganic flow battery electrodes. 22igure 14: Anthraquinone and anthraquinone with either zero, mono-, di- and tetra-sulfonic acid or hydroxyl groups. Anthraquinone (AQ), anthraquinonesulfonic acid (AQS),anthraquinone-disulfonic acid (AQDS), anthraquinone-tetrasulfonic acid (AQTS), hydroxyl-anthraquinone (HAQ), dihydroxyl-anthraquinone (DHAQ) and tetrahydroxyl-anthraquinone(THAQ) are illustrated.Figure 15: Predicted intrinsic and pH dependent solubility of seven anthraquinone familymolecules with sulfonic or hydroxyl groups. Although their intrinsic solubility is predicted tohave similar values, their pH-dependent solubility values are significantly varied dependingon how many and which functional groups are attached.23able 4: pH-dependent solubility of AQ with multiple R-groups where sulfonic acid andhydroxyl groups are considered. The pH-dependent solubility of them are estimated byMultDK and Chemaxon Cxcalc. MultiDK CxcalcpH 0 7 14 0 7 14AQ -4.9 -4.9 -4.9 -4.5 -4.5 -4.5AQS -1.5 -1.3 -1.3 -1.6 0 0AQDS 0.1 0.3 0.3 0 0 0AQTS 1.6 1.8 1.8 0 0 0HAQ -3.9 -3.7 -1.7 -4.1 -3.9 0DHAQ -3.4 -3.0 0.9 -3.7 -3.3 0THAQ -3.5 -3.1 2.8 -3.3 -2.9 0 Conclusion
Organic aqueous flow battery systems require highly soluble electrolytes, which are two- tofive-fold more soluble than pharmaceutical drugs. In order to search molecules with sucha tight solubility requirement, high-throughput screening is a compelling approach espe-cially when it is combined with an efficient solubility prediction method. Moreover, theinvestigation of pH-dependent solubility is essential to discovery highly soluble moleculeswhich include an ionizable fragment such as the sulfonic acid (-SO H), the carboxylic acid(-COOH), the hydroxyl (-OH) and the dihydrogen phosphite (-PO H ). We have developeda multiple descriptor multiple kernel (MultiDK) approach as an efficient property predictionmethod. As the ensemble descriptor consists of structure hash and fragment keys finger-prints as well as one or a few property specific descriptors such as molecular weight onlyor additionally Labute’s approximate surface area, and a partition coefficient, it has shownthat MultiDK is capable of fast, accurate and universal solubility prediction. By the exten-sion of MultiDK, the pH-dependent solubility of various quinones even with strong acidicor alkaline functional groups was investigated at each pH point where the quinones are thestrong candidates of electrolytes for organic aqueous flow batteries.24 cknowledgement This work was funded by the U.S. DOE ARPA-E award de-ar0000348. We thank Roy G.Gordon and Michael J. Aziz for helpful discussions. The support of Changwon Suh andRafael G´omez-Bombarel was useful in this work.
References (1) Huskinson, B.; Marshak, M. P.; Suh, C.; Er, S.; Gerhardt, M. R.; Galvin, C. J.;Chen, X.; Aspuru-Guzik, A.; Gordon, R. G.; Aziz, M. J.
Nature , , 195–198.(2) Yang, B.; Hoober-Burkhardt, L.; Wang, F.; Prakash, G. K. S.; Narayanan, S. R. Journalof The Electrochemical Society , , A1371–A1380, 00000.(3) Lin, K.; Chen, Q.; Gerhardt, M. R.; Tong, L.; Kim, S. B.; Eisenach, L.; Valle, A. W.;Hardee, D.; Gordon, R. G.; Aziz, M. J.; Marshak, M. P. Science , , 1529–1532.(4) Liu, T.; Wei, X.; Nie, Z.; Sprenkle, V.; Wang, W. Advanced Energy Materials , ,1501449.(5) Winsberg, J.; Janoschka, T.; Morgenstern, S.; Hagemann, T.; Muench, S.; Hauff-man, G.; Gohy, J.-F.; Hager, M. D.; Schubert, U. S. Advanced Materials , ,2238–2243.(6) Soloveichik, G. L. Chemical Reviews , , 11533–11558, PMID: 26389560.(7) Yang, B.; Hoober-Burkhardt, L.; Krishnamoorthy, S.; Murali, A.; Prakash, G. K. S.;Narayanan, S. R. Journal of The Electrochemical Society , , A1442–A1449.(8) Pyzer-Knapp, E. O.; Simm, G. N.; Aspuru-Guzik, A. Materials Horizons , ,226–233. 259) Plessow, P. N.; Bajdich, M.; Greene, J.; Vojvodic, A.; Abild-Pedersen, F. The Journalof Physical Chemistry C , , 10351–10360.(10) Peplow, M. Nature News , , 148.(11) Santos, E. J. G.; N ˜A¸rskov, J. K.; Vojvodic, A. The Journal of Physical Chemistry C , , 17662–17666.(12) Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Journal of Chemical Infor-mation and Modeling , , 263–274.(13) Shu, Y.; Levine, B. G. The Journal of Chemical Physics , , 104104.(14) Hachmann, J.; Olivares-Amaya, R.; Jinich, A.; Appleton, A. L.; Blood-Forsythe, M. A.;Seress, L. R.; Rom´an-Salgado, C.; Trepte, K.; Atahan-Evrenk, S.; Er, S.; Shrestha, S.;Mondal, R.; Sokolov, A.; Bao, Z.; Aspuru-Guzik, A. Energy & Environmental Science , , 698–704.(15) Curtarolo, S.; Hart, G. L. W.; Nardelli, M. B.; Mingo, N.; Sanvito, S.; Levy, O. NatureMaterials , , 191–201.(16) Kanal, I. Y.; Owens, S. G.; Bechtel, J. S.; Hutchison, G. R. The Journal of PhysicalChemistry Letters , , 1613–1623.(17) Sokolov, A. N.; Atahan-Evrenk, S.; Mondal, R.; Akkerman, H. B.; S ˜A¡nchez-Carrera, R. S.; Granados-Focil, S.; Schrier, J.; Mannsfeld, S. C. B.; Zoombelt, A. P.;Bao, Z.; Aspuru-Guzik, A. Nature Communications , , 437.(18) Fischer, C. C.; Tibbetts, K. J.; Morgan, D.; Ceder, G. Nature Materials , ,641–646.(19) Shoichet, B. K. Nature , , 862–865.(20) Bajorath, J. Nature Reviews Drug Discovery , , 882–894.2621) Er, S.; Suh, C.; Marshak, M. P.; Aspuru-Guzik, A. Chem. Sci. , , 885–893.(22) Pineda Flores, S. D.; Martin-Noble, G. C.; Phillips, R. L.; Schrier, J. The Journal ofPhysical Chemistry C , , 21800–21809.(23) Wang, J.; Hou, T. Combinatorial chemistry & high throughput screening , ,328–338, 00036.(24) Skyner, R. E.; McDonagh, J. L.; Groom, C. R.; Mourik, T. v.; Mitchell, J. B. O. Physical Chemistry Chemical Physics , , 6174–6191, 00001.(25) Huuskonen, J. Journal of Chemical Information and Computer Sciences , ,773–777.(26) Bhal, S. K.; Kassam, K.; Peirson, I. G.; Pearl, G. M. Molecular Pharmaceutics , , 556–560.(27) Bergstr¨om, C. A. S.; Luthman, K.; Artursson, P. European Journal of PharmaceuticalSciences , , 387–398, 00000.(28) Mitchell, J. B. O. Wiley Interdisciplinary Reviews: Computational Molecular Science , , 468–481, 00011.(29) Hughes, L. D.; Palmer, D. S.; Nigsch, F.; Mitchell, J. B. O. Journal of Chemical Infor-mation and Modeling , , 220–232, 00085.(30) Palmer, D. S.; O’Boyle, N. M.; Glen, R. C.; Mitchell, J. B. O. Journal of ChemicalInformation and Modeling , , 150–158, 00000.(31) McDonagh, J. L.; Nath, N.; De Ferrari, L.; van Mourik, T.; Mitchell, J. B. O. Journalof Chemical Information and Modeling , , 844–856, 00000.(32) Marten, B.; Kim, K.; Cortis, C.; Friesner, R. A.; Murphy, R. B.; Ringnalda, M. N.;Sitkoff, D.; Honig, B. The Journal of Physical Chemistry , , 11775–11788.2733) Tannor, D. J.; Marten, B.; Murphy, R.; Friesner, R. A.; Sitkoff, D.; Nicholls, A.;Honig, B.; Ringnalda, M.; Goddard, W. A. Journal of the American Chemical Society , , 11875–11882.(34) Raccuglia, P.; Elbert, K. C.; Adler, P. D. F.; Falk, C.; Wenny, M. B.; Mollo, A.;Zeller, M.; Friedler, S. A.; Schrier, J.; Norquist, A. J. Nature , , 73–76.(35) Silver, D. et al. Nature , , 484–489.(36) Jain, N.; Yalkowsky, S. H. Journal of Pharmaceutical Sciences , , 234–252.(37) Ran, Y.; He, Y.; Yang, G.; Johnson, J. L. H.; Yalkowsky, S. H. Chemosphere , ,487–509.(38) Delaney, J. S. Journal of Chemical Information and Computer Sciences , ,1000–1005.(39) Wang, J.; Hou, T.; Xu, X. Journal of Chemical Information and Modeling , ,571–581, PMID: 19226181.(40) Tetko, I. V.; Bruneau, P. Journal of Pharmaceutical Sciences , , 3103–3110.(41) Tetko, I. V.; Tanchuk, V. Y.; Villa, A. E. P. Journal of Chemical Information andComputer Sciences , , 1407–1421, 00288.(42) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Advanced Drug DeliveryReviews , , 3–26, 00000.(43) Viswanadhan, V. N.; Ghose, A. K.; Revankar, G. R.; Robins, R. K. Journal of ChemicalInformation and Computer Sciences , , 163–172.(44) Ali, J.; Camilleri, P.; Brown, M. B.; Hutt, A. J.; Kirton, S. B. Journal of ChemicalInformation and Modeling , , 420–428.2845) Zhou, D.; Alelyunas, Y.; Liu, R. Journal of Chemical Information and Modeling , , 981–987.(46) Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Journal of Chemical Infor-mation and Computer Sciences , , 1273–1280.(47) Klopman, G.; Wang, S.; Balthasar, D. M. Journal of Chemical Information and Com-puter Sciences , , 474–482, 00000.(48) K¨uhne, R.; Ebert, R. U.; Kleint, F.; Schmidt, G.; Sch¨u¨urmann, G. Chemosphere , , 2061–2077.(49) Cheng, T.; Li, Q.; Wang, Y.; Bryant, S. H. Journal of Chemical Information andModeling , , 229–236, 00019.(50) Tetko, I. V.; Poda, G. I. Journal of Medicinal Chemistry , , 5601–5604.(51) Xing, L.; Glen, R. C. Journal of Chemical Information and Computer Sciences , , 796–805.(52) Hall, L. H.; Kier, L. B. Journal of Chemical Information and Computer Sciences , , 1039–1045.(53) Rogers, D.; Hahn, M. Journal of Chemical Information and Modeling , , 742–754.(54) Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. In Advances in Neural Information Processing Systems 28 ;Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., Garnett, R., Eds.; CurranAssociates, Inc., 2015; pp 2224–2232.(55) Lind, P.; Maltseva, T.
Journal of Chemical Information and Computer Sciences , , 1855–1859, PMID: 14632433. 2956) Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E. L. CurrentPharmaceutical Design , , 2111–2120.(57) Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R.; others, The Annals of statistics , , 407–499.(58) Hou, T. J.; Xia, K.; Zhang, W.; Xu, X. J. Journal of Chemical Information and Com-puter Sciences , , 266–275.(59) Ledwidge, M. T.; Corrigan, O. I. International Journal of Pharmaceutics , ,187–200.(60) Hansen, N. T.; Kouskoumvekaki, I.; Jørgensen, F. S.; Brunak, S.; J´onsd´ottir, S. ´O. Journal of Chemical Information and Modeling , , 2601–2609, 00000.(61) Wang, J.-B.; Cao, D.-S.; Zhu, M.-F.; Yun, Y.-H.; Xiao, N.; Liang, Y.-Z. Journal ofChemometrics , , 389–398, 00000.(62) Pyzer-Knapp, E. O.; Suh, C.; G´omez-Bombarelli, R.; Aguilera-Iparraguirre, J.; Aspuru-Guzik, A. Annual Review of Materials Research , , 195–216.(63) Kearnes, S. M.; Haque, I. S.; Pande, V. S. Journal of chemical information and modeling , , 5–15.(64) Wang, J.; Krudy, G.; Hou, T.; Zhang, W.; Holland, G.; Xu, X. Journal of ChemicalInformation and Modeling , , 1395–1404.(65) Willighagen, E. L.; Denissen, H. M. G. W.; Wehrens, R.; Buydens, L. M. C. Journalof Chemical Information and Modeling , , 487–494, PMID: 16562976.(66) McKinney, W. Data Structures for Statistical Computing in Python. Proceedings ofthe 9th Python in Science Conference. 2010; pp 51 – 56.(67) Pedregosa, F. et al. Journal of Machine Learning Research , , 2825–2830.3068) Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.2015; http://tensorflow.org/ , Software available from tensorflow.org.(69) Waskom, M. et al. seaborn: v0.7.0 (January 2016). 2016; http://dx.doi.org/10.5281/zenodo.45133 .(70) G¨onen, M.; Alpaydin, E. The Journal of Machine Learning Research , , 2211–2268.(71) Bach, F. R.; Lanckriet, G. R. G.; Jordan, M. I. Multiple Kernel Learning, Conic Duality,and the SMO Algorithm. Proceedings of the Twenty-first International Conference onMachine Learning. 2004.(72) Lanckriet, G. R.; Cristianini, N.; Bartlett, P.; Ghaoui, L. E.; Jordan, M. I. The Journalof Machine Learning Research , , 27–72.(73) Yu, S.; Falck, T.; Daemen, A.; Tranchevent, L.-C.; Suykens, J. A.; Moor, B. D.;Moreau, Y. BMC Bioinformatics , , 309.(74) Chen, L.; Duan, L.; Xu, D. Event Recognition in Videos by Learning from Heteroge-neous Web Sources. 2013 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR). 2013; pp 2666–2673.(75) Xu, X.; Tsang, I. W.; Xu, D. IEEE transactions on neural networks and learningsystems , , 749–761.(76) Pekalska, E.; Paclik, P.; Duin, R. P. W. J. Machine Learning Res. , 175–211.(77) Kew, W.; Mitchell, J. B. O.
Molecular Informatics , , 634–647.(78) RDKit: Open-source cheminformatics. EnvironmentalToxicology and Chemistry , , 1109–1117.(82) Chemicalize.org was used for name to structure generation/prediction of xyz proper-ties/etc, Chemaxon. chemicalize.org , 2015; Accessed: 2015-10-06.(83) Labute, P. Journal of Molecular Graphics and Modelling , , 464–477.(84) Tetko, I. V.; Tanchuk, V. Y.; Kasheva, T. N.; Villa, A. E. P. Journal of ChemicalInformation and Computer Sciences , , 1488–1493, 00244.32 upplementary Information MultiDK vs. SVR and DNN
The performance of support vector regression (SVR) and deep neural network (DNN) aretested for solubility estimation. The same descriptors to the cases of MultiDK23 are usedfor them. We evaluate SVR and DNN using the Scikit-learn and the Tensorflow packages inPython, respectively.For SVR, we choose the kernel as radial basis function (RBF), which is given by k RBF ( x , y ) = e γ | x − y | (5)Penalty hyper parameter C is searched for seven logarithmically equal spaced points from 1E-3 to 1E+3, while the other hyper parameter (cid:15) specifying the epsilon-tube and γ are adjustedby the default values provided in the Scikit-learn package: (cid:15) = 0 . γ = 1 / r values of MultiDKand RBF-SVR are 0.87 and 0.83, respectively.Table 5: Average and std of the best r values of SVR for each data setMethod SVR MultiDKBest C E[ r ] std( r ) Best α E[ r ] std( r )1676 molecules 1E+2 0.88 0.02 1E-1 0.91 0.03496 molecules 1E+2 0.87 0.04 7E-2 0.89 0.051140 molecules 1E+2 0.90 0.01 3E-2 0.92 0.023310 molecules 1E+1 0.83 0.01 1E-1 0.87 0.04For DNN, we evaluate the largest data set which includes the 3310 molecules. Also 20%of them are used for external testing while the 20% of the remained molecules are used forinternal validation for DNN. We applied a lot of different network architectures manually33igure 16: The r distributions of SVR with respect to the hyper parameter of C .and eventually find that a three hidden layer DNN with 100, 50, 10 weights for the first,second and third hidden layers shows the best performance among all our test structures.The performance of the best DNN is r =0.84, RMSE=0.86, MAE=0.60, DAE=0.42 for thetest molecules, which is worse than the average r of MultDK23 whereas DNN also employsthe same descriptors to those of MultDK23, as aforementioned. The DAE represnts medianabsolute error. Kernels for a binary descriptor
The Tanimoto similarity has been used as a kernel function to exploit binary feature infor-mation such as recognizing white images on a black background. For further understanding,we compare the Tanimimoto similarity kernel with the linear kernel. The linear kernel isgiven by k L ( x , x ai ) = x T x ai = s (6)34igure 17: The experimental and predicted solubility of DNN for the test molecules arecompared.and the Tanimoto similarity kernel is given by k T ( x , x ai ) = f ∧ ( x , x ai ) f ∨ ( x , x ai ) = ss + d = 11 + d/s (7)where both s = x T x i and f ∧ ( x , x ai ) = (cid:80) j x j ∧ x ai,j = s are both the number of common 1’s intwo vectors, f ∨ ( x , x ai ) = (cid:80) j x j ∨ x ai,j is the number of 1’s in any two vectors and d is equal to f ∨ ( x , x ai ) − s . The linear kernel of k L ( x , x ai ) does not rely on d , while k T ( x , x ai ) is inverselyproportional to d similar to a characteristic of the radial basis function. Therefore, a kernelregression with k T ( x , x aiai