Igor V. Tetko
University of Lausanne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Igor V. Tetko.
Journal of Computer-aided Molecular Design | 2005
Igor V. Tetko; Johann Gasteiger; Roberto Todeschini; A. Mauri; David J. Livingstone; Peter Ertl; V. A. Palyulin; E. V. Radchenko; Nikolai S. Zefirov; Alexander Makarenko; Vsevolod Yu. Tanchuk; Volodymyr V. Prokopenko
Internet technology offers an excellent opportunity for the development of tools by the cooperative effort of various groups and institutions. We have developed a multi-platform software system, Virtual Computational Chemistry Laboratory, http://www.vcclab.org, allowing the computational chemist to perform a comprehensive series of molecular indices/properties calculations and data analysis. The implemented software is based on a three-tier architecture that is one of the standard technologies to provide client-server services on the Internet. The developed software includes several popular programs, including the indices generation program, DRAGON, a 3D structure generator, CORINA, a program to predict lipophilicity and aqueous solubility of chemicals, ALOGPS and others. All these programs are running at the host institutes located in five countries over Europe. In this article we review the main features and statistics of the developed system that can be used as a prototype for academic and industry models.
Science | 2007
Christina A. Cuomo; Ulrich Güldener; Jin-Rong Xu; Frances Trail; B. Gillian Turgeon; Antonio Di Pietro; Jonathan D. Walton; Li-Jun Ma; Scott E. Baker; Martijn Rep; Gerhard Adam; John Antoniw; Thomas K. Baldwin; Sarah E. Calvo; Yueh Long Chang; David DeCaprio; Liane R. Gale; Sante Gnerre; Rubella S. Goswami; Kim E. Hammond-Kosack; Linda J. Harris; Karen Hilburn; John C. Kennell; Scott Kroken; Jon K. Magnuson; Gertrud Mannhaupt; Evan Mauceli; Hans W. Mewes; Rudolf Mitterbauer; Gary J. Muehlbauer
We sequenced and annotated the genome of the filamentous fungus Fusarium graminearum, a major pathogen of cultivated cereals. Very few repetitive sequences were detected, and the process of repeat-induced point mutation, in which duplicated sequences are subject to extensive mutation, may partially account for the reduced repeat content and apparent low number of paralogous (ancestrally duplicated) genes. A second strain of F. graminearum contained more than 10,000 single-nucleotide polymorphisms, which were frequently located near telomeres and within other discrete chromosomal segments. Many highly polymorphic regions contained sets of genes implicated in plant-fungus interactions and were unusually divergent, with higher rates of recombination. These regions of genome innovation may result from selection due to interactions of F. graminearum with its plant hosts.
Journal of Chemical Information and Computer Sciences | 1995
Igor V. Tetko; David J. Livingstone; A. I. Luik
The application of feed forward back propagation artificial neural networks with one hidden layer (ANN) to perform the equivalent of multiple linear regression (MLR) has been examined using artificial structured data sets and real literature data. The predictive ability of the networks has been estimated using a training/ test set protocol. The results have shown advantages of ANN over MLR analysis. The ANNs do not require high order terms or indicator variables to establish complex structure-activity relationships. Overfitting does not have any influence on network prediction ability when overtraining is avoided by cross-validation. Application of ANN ensembles has allowed the avoidance of chance correlations and satisfactory predictions of new data have been obtained for a wide range of numbers of neurons in the hidden layer.
Journal of Chemical Information and Computer Sciences | 2002
Igor V. Tetko; Vsevolod Yu. Tanchuk
This article provides a systematic study of several important parameters of the Associative Neural Network (ASNN), such as the number of networks in the ensemble, distance measures, neighbor functions, selection of smoothing parameters, and strategies for the user-training feature of the algorithm. The performance of the different methods is assessed with several training/test sets used to predict lipophilicity of chemical compounds. The Spearman rank-order correlation coefficient and Parzen-window regression methods provide the best performance of the algorithm. If additional user data is available, an improved prediction of lipophilicity of chemicals up to 2-5 times can be calculated when the appropriate smoothing parameters for the neural network are selected. The detected best combinations of parameters and strategies are implemented in the ALOGPS 2.1 program that is publicly available at http://www.vcclab.org/lab/alogps.
Journal of Pharmaceutical Sciences | 2009
Raimund Mannhold; Gennadiy I. Poda; Claude Ostermann; Igor V. Tetko
We first review the state-of-the-art in development of log P prediction approaches falling in two major categories: substructure-based and property-based methods. Then, we compare the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed (N = 882) and Pfizer (N = 95809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively. Accuracy of models declined with the number of nonhydrogen atoms. The Arithmetic Average Model (AAM), which predicts the same value (the arithmetic mean) for all compounds, was used as a baseline model for comparison. Methods with Root Mean Squared Error (RMSE) greater than RMSE produced by the AAM were considered as unacceptable. The majority of analyzed methods produced reasonable results for the public dataset but only seven methods were successful on the both in house datasets. We proposed a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(+/-0.02) + 0.11(+/-0.001) NC-0.11(+/-0.001) NHET. This equation outperformed a large number of programs benchmarked in this study. Factors influencing the accuracy of log P predictions were elucidated and discussed.
Computational Biology and Chemistry | 2005
Yu Wang; Igor V. Tetko; Mark A. Hall; Eibe Frank; Axel Facius; Klaus F. X. Mayer; Hans-Werner Mewes
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, naïve Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.
Journal of Chemical Information and Computer Sciences | 2001
Igor V. Tetko; and Vsevolod Yu. Tanchuk; Alessandro E. P. Villa
A new method, ALOGPS v 2.0 (http://www.lnh.unil.ch/~itetko/logp/), for the assessment of n-octanol/water partition coefficient, log P, was developed on the basis of neural network ensemble analysis of 12 908 organic compounds available from PHYSPROP database of Syracuse Research Corporation. The atom and bond-type E-state indices as well as the number of hydrogen and non-hydrogen atoms were used to represent the molecular structures. A preliminary selection of indices was performed by multiple linear regression analysis, and 75 input parameters were chosen. Some of the parameters combined several atom-type or bond-type indices with similar physicochemical properties. The neural network ensemble training was performed by efficient partition algorithm developed by the authors. The ensemble contained 50 neural networks, and each neural network had 10 neurons in one hidden layer. The prediction ability of the developed approach was estimated using both leave-one-out (LOO) technique and training/test protocol. In case of interseries predictions, i.e., when molecules in the test and in the training subsets were selected by chance from the same set of compounds, both approaches provided similar results. ALOGPS performance was significantly better than the results obtained by other tested methods. For a subset of 12 777 molecules the LOO results, namely correlation coefficient r(2)= 0.95, root mean squared error, RMSE = 0.39, and an absolute mean error, MAE = 0.29, were calculated. For two cross-series predictions, i.e., when molecules in the training and in the test sets belong to different series of compounds, all analyzed methods performed less efficiently. The decrease in the performance could be explained by a different diversity of molecules in the training and in the test sets. However, even for such difficult cases the ALOGPS method provided better prediction ability than the other tested methods. We have shown that the diversity of the training sets rather than the design of the methods is the main factor determining their prediction ability for new data. A comparative performance of the methods as well as a dependence on the number of non-hydrogen atoms in a molecule is also presented.
Journal of Chemical Information and Computer Sciences | 2001
Igor V. Tetko; Vsevolod Yu. Tanchuk; Tamara N. Kasheva; Alessandro E. P. Villa
The molecular weight and electrotopological E-state indices were used to estimate by Artificial Neural Networks aqueous solubility for a diverse set of 1291 organic compounds. The neural network with 33-4-1 neurons provided highly predictive results with r(2) = 0.91 and RMS = 0.62. The used parameters included several combinations of E-state indices with similar properties. The calculated results were similar to those published for these data by Huuskonen (2000). However, in the current study only E-state indices were used without need of additional indices (the molecular connectivity, shape, flexibility and indicator indices) also considered in the previous study. In addition, the present neural network contained three times less hidden neurons. Smaller neural networks and use of one homogeneous set of parameters provides a more robust model for prediction of aqueous solubility of chemical compounds. Limitations of the developed method for prediction of large compounds are discussed. The developed approach is available online at http://www.lnh.unil.ch/~itetko/logp.
Drug Discovery Today | 2005
Igor V. Tetko
The development of on-line software tools is changing the way we traditionally perform our analysis in drug design, but will chemoinformatics be forever behind bioinformatics in this development?
Journal of Chemical Information and Modeling | 2008
Hao Zhu; Alexander Tropsha; Denis Fourches; Alexandre Varnek; Ester Papa; Paola Gramatica; Tomas Öberg; Phuong Dao; Artem Cherkasov; Igor V. Tetko
Selecting most rigorous quantitative structure-activity relationship (QSAR) approaches is of great importance in the development of robust and predictive models of chemical toxicity. To address this issue in a systematic way, we have formed an international virtual collaboratory consisting of six independent groups with shared interests in computational chemical toxicology. We have compiled an aqueous toxicity data set containing 983 unique compounds tested in the same laboratory over a decade against Tetrahymena pyriformis. A modeling set including 644 compounds was selected randomly from the original set and distributed to all groups that used their own QSAR tools for model development. The remaining 339 compounds in the original set (external set I) as well as 110 additional compounds (external set II) published recently by the same laboratory (after this computational study was already in progress) were used as two independent validation sets to assess the external predictive power of individual models. In total, our virtual collaboratory has developed 15 different types of QSAR models of aquatic toxicity for the training set. The internal prediction accuracy for the modeling set ranged from 0.76 to 0.93 as measured by the leave-one-out cross-validation correlation coefficient ( Q abs2). The prediction accuracy for the external validation sets I and II ranged from 0.71 to 0.85 (linear regression coefficient R absI2) and from 0.38 to 0.83 (linear regression coefficient R absII2), respectively. The use of an applicability domain threshold implemented in most models generally improved the external prediction accuracy but at the same time led to a decrease in chemical space coverage. Finally, several consensus models were developed by averaging the predicted aquatic toxicity for every compound using all 15 models, with or without taking into account their respective applicability domains. We find that consensus models afford higher prediction accuracy for the external validation data sets with the highest space coverage as compared to individual constituent models. Our studies prove the power of a collaborative and consensual approach to QSAR model development. The best validated models of aquatic toxicity developed by our collaboratory (both individual and consensus) can be used as reliable computational predictors of aquatic toxicity and are available from any of the participating laboratories.