Sabina Smusz
Polish Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sabina Smusz.
Journal of Cheminformatics | 2014
Rafał Kurczab; Sabina Smusz; Andrzej J. Bojarski
BackgroundThe paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods.ResultsThe impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set.ConclusionsIn conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
Journal of Cheminformatics | 2013
Sabina Smusz; Rafał Kurczab; Andrzej J. Bojarski
BackgroundA growing popularity of machine learning methods application in virtual screening, in both classification and regression tasks, can be observed in the past few years. However, their effectiveness is strongly dependent on many different factors.ResultsIn this study, the influence of the way of forming the set of inactives on the classification process was examined: random and diverse selection from the ZINC database, MDDR database and libraries generated according to the DUD methodology. All learning methods were tested in two modes: using one test set, the same for each method of inactive molecules generation and using test sets with inactives prepared in an analogous way as for training. The experiments were carried out for 5 different protein targets, 3 fingerprints for molecules representation and 7 classification algorithms with varying parameters. It appeared that the process of inactive set formation had a substantial impact on the machine learning methods performance.ConclusionsThe level of chemical space limitation determined the ability of tested classifiers to select potentially active molecules in virtual screening tasks, as for example DUDs (widely applied in docking experiments) did not provide proper selection of active molecules from databases with diverse structures. The study clearly showed that inactive compounds forming training set should be representative to the highest possible extent for libraries that undergo screening.
Bioorganic & Medicinal Chemistry Letters | 2014
Jagna Witek; Sabina Smusz; Krzysztof Rataj; Stefan Mordalski; Andrzej J. Bojarski
In this Letter, we present a novel methodology of searching for biologically active compounds, which is based on the combination of docking experiments and analysis of the results by machine learning methods. The study was performed for 5 different protein kinases, and several sets of compounds (active, inactive and assumed inactives) were docked into their targets. The resulting ligand-protein complexes were represented by the means of structural interaction fingerprints profiles (SIFts profiles) that constituted an input for ML methods. The developed protocol was found to be superior to the combination of classification algorithms with the standard fingerprint MACCSFP.
Bioorganic & Medicinal Chemistry Letters | 2015
Sabina Smusz; Rafał Kurczab; Grzegorz Satała; Andrzej J. Bojarski
Virtual screening towards the search of new 5-HT6R ligands was carried out with three different fingerprints used for molecules representation. Two structurally new compounds were found to be characterized by a significant 5-HT6R activity (Ki of 119 and 670 nM). The compounds do not possess a positive ionizable group in their structures and therefore they belong to the group of atypical, non-basic 5-HT6R ligands. The obtained hits were proved to fit well in the 5-HT6R binding cavity by docking and molecular dynamic simulation experiments. Moreover, an in silico evaluation of the ADMET properties of these compounds predicted their drug-like character.
Journal of Chemical Information and Modeling | 2015
Sabina Smusz; Stefan Mordalski; Jagna Witek; Krzysztof Rataj; Rafał Kafel; Andrzej J. Bojarski
Molecular docking, despite its undeniable usefulness in computer-aided drug design protocols and the increasing sophistication of tools used in the prediction of ligand-protein interaction energies, is still connected with a problem of effective results analysis. In this study, a novel protocol for the automatic evaluation of numerous docking results is presented, being a combination of Structural Interaction Fingerprints and Spectrophores descriptors, machine-learning techniques, and multi-step results analysis. Such an approach takes into consideration the performance of a particular learning algorithm (five machine learning methods were applied), the performance of the docking algorithm itself, the variety of conformations returned from the docking experiment, and the receptor structure (homology models were constructed on five different templates). Evaluation using compounds active toward 5-HT6 and 5-HT7 receptors, as well as additional analysis carried out for beta-2 adrenergic receptor ligands, proved that the methodology is a viable tool for supporting virtual screening protocols, enabling proper discrimination between active and inactive compounds.
Journal of Cheminformatics | 2011
Rafał Kurczab; Sabina Smusz; Andrzej J. Bojarski
In silico High Throughput Screening of large compound databases has become increasingly popular technology of finding valuable drug candidates, by applying a wide range of computational methods, such as machine learning [1]. In recent years, many comparative studies of different machine learning methods performance in ligand-based virtual screening have been reported [2,3]. n nIn order to extend these studies, we have evaluated over 60 different machine learning methods, such as: support vector machines (with and without parameter optimization), naive Bayesian, decision trees, random forest, meta-classifiers (boosting, bagging, grading) and many others. All calculations were performed using a collection of machine learning algorithms for data mining implemented in WEKA package [4]. Additionally, for each of the method, we have examined the influence of different type of fingerprints, the size of training sets and attribute selection methods on the rate of active recall and precision of selection. Our internal database of known 5-HT7 antagonists has been used to build training and testing sets. n nIt was found that there is no machine learning approach that consistently provides the best results but some of them are very stable and can be applied universally.
Journal of Cheminformatics | 2015
Stefan Mordalski; Jagna Witek; Sabina Smusz; Krzysztof Rataj; Andrzej J. Bojarski
AbstractBackgroundDistinguishing active from inactive compounds is one of the crucial problems of molecular docking, especially in the context of virtual screening experiments. The randomization of poses and the natural flexibility of the protein make this discrimination even harder. Some of the recent approaches to post-docking analysis use an ensemble of receptor models to mimic this naturally occurring conformational diversity. However, the optimal number of receptor conformations is yet to be determined.In this study, we compare the results of a retrospective screening of beta-2 adrenergic receptor ligands performed on both the ensemble of receptor conformations extracted from ten available crystal structures and an equal number of homology models. Additional analysis was also performed for homology models with up to 20 receptor conformations considered.ResultsThe docking results were encoded into the Structural Interaction Fingerprints and were automatically analyzed by support vector machine. The use of homology models in such virtual screening application was proved to be superior in comparison to crystal structures. Additionally, increasing the number of receptor conformational states led to enhanced effectiveness of active vs. inactive compounds discrimination.ConclusionsFor virtual screening purposes, the use of homology models was found to be most beneficial, even in the presence of crystallographic data regarding the conformational space of the receptor. The results also showed that increasing the number of receptors considered improves the effectiveness of identifying active compounds by machine learning methods.n Graphical abstractComparison of machine learning results obtained for various number of beta-2 AR homology models and crystal structures.
Journal of Cheminformatics | 2013
Rafał Kurczab; Sabina Smusz; Andrzej J. Bojarski
In drug discovery, machine learning is widely used to classify molecules as active or inactive against a particular target. The vast majority of these methods (supervised learning) needs a training set of objects (molecules) to develop a decision rule that can be used to classify new entities (the test set) into one of the two mentioned classes [1]. A lot of studies, searching an optimal learning parameters and their impact on classification effectiveness were performed [2,3]. Unfortunately, there is no data showing the influence of actives/inactives ratio, used to model training, on the efficiency of new active compounds identification. Therefore, the main goal of this study was to examine the impact of changing the number of inactives in the training set with fixed amount of actives. For a given ratio, the inactives were randomly selected from ZINC database (10-times to prevent an overestimations error). This concept was verified on three different protein targets (i.e. 5-HT1A, HIV-1 protease and matrix metalloproteinase) and a set of algorithms (SMO, Naive Bayes, Ibk, J48 and Random Forest) implemented in WEKA package [4]. To compounds representation, two types of molecular fingerprints were used (MACCS and hashed fingerprint), to determine their possible impact on machine learning performance.
Journal of Cheminformatics | 2013
Jagna Witek; Krzysztof Rataj; Stefan Mordalski; Sabina Smusz; Tomasz Kosciolek; Andrzej J. Bojarski
Cheminformatic methods, such as Virtual Screening, constitute a vital part of modern drug design process. This technique enables not only viable prediction of physicochemical properties of the molecules, but also effective database mining, being particularly useful tool in search for ligands of desired activity. Successful performance in case of single target drugs, implies a potential to extend its capabilities to compounds bearing desired activity towards multiple receptors. n nIn this research, we present application of Structural Interaction Fingerprints (SIFts) [1] combined with Machine Learning (ML) as a method to select single- and multi-target ligands from the docking results. A handful of protein kinases pairs was designated as targets. Collection of pseudo selective compounds, with various activity profiles, was acquired from ChEMBL database. Furthermore, for each target a set of ligands of various activity was aggregated. Decoy structures were random ligands from ZINC database. The compounds were docked into respective proteins, and SIFts were calculated for each protein-ligand complex. Training sets used in ML experiments consisted of cluster centroids of active and inactive ligands, whereas test sets were composed of remaining compounds, that returned docking poses. n nThe key aim of this study is to develop a viable method to filter the docking results, so that the compounds meeting desired activity profile are selected.
Journal of Cheminformatics | 2013
Sabina Smusz; Rafał Kurczab; Andrzej J. Bojarski
Computational techniques have become a vital part of today’s drug discovery campaigns. Among a wide range of tools applied in this process, machine learning methods can be distinguished. They are used for instance in virtual screening (VS), where its role is to identify potentially active compounds out of large libraries of structures [1]. In order to enable the application of various learning algorithms in VS tasks, an appropriate representation of molecules is needed. One of the solutions comes from the hashed fingerprints, encoding the information about the structure in a form of a bit string [2]. Both length and density (the percentage of 1’s) can be modified during hashed fingerprint generation, which (as it was already proved) influence the similarity searching process [3]. The aim of our study was to examine the impact of such fingerprint density on the performance of machine learning methods. A series of bit strings with different density values and of various lengths was generated by means of the RDKit software [4]. They were tested in classification tests of 5-HT1A ligands, with the use of a set of algorithms (Naïve Bayes, SMO, Ibk, Decorate, Hyperpipes, J48 and Random Forest), in order to determine an optimal values of the variables for machine learning experiments.