Constantin F. Aliferis
New York University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Constantin F. Aliferis.
Machine Learning | 2006
Ioannis Tsamardinos; Laura E. Brown; Constantin F. Aliferis
We present a new algorithm for Bayesian network structure learning, called Max-Min Hill-Climbing (MMHC). The algorithm combines ideas from local learning, constraint-based, and search-and-score techniques in a principled and effective way. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. In our extensive empirical evaluation MMHC outperforms on average and in terms of various metrics several prototypical and state-of-the-art algorithms, namely the PC, Sparse Candidate, Three Phase Dependency Analysis, Optimal Reinsertion, Greedy Equivalence Search, and Greedy Search. These are the first empirical results simultaneously comparing most of the major Bayesian network algorithms against each other. MMHC offers certain theoretical advantages, specifically over the Sparse Candidate algorithm, corroborated by our experiments. MMHC and detailed results of our study are publicly available at http://www.dsl-lab.org/supplements/mmhc_paper/mmhc_index.html.
Bioinformatics | 2005
Alexander R. Statnikov; Constantin F. Aliferis; Ioannis Tsamardinos; Douglas P. Hardin; Shawn Levy
MOTIVATION Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. RESULTS Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. AVAILABILITY The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. CONTACT [email protected].
knowledge discovery and data mining | 2003
Ioannis Tsamardinos; Constantin F. Aliferis; Alexander R. Statnikov
Data Mining with Bayesian Network learning has two important characteristics: under conditions learned edges between variables correspond to casual influences, and second, for every variable T in the network a special subset (Markov Blanket) identifiable by the network is the minimal variable set required to predict T. However, all known algorithms learning a complete BN do not scale up beyond a few hundred variables. On the other hand, all known sound algorithms learning a local region of the network require an exponential number of training instances to the size of the learned region.The contribution of this paper is two-fold. We introduce a novel local algorithm that returns all variables with direct edges to and from a target variable T as well as a local algorithm that returns the Markov Blanket of T. Both algorithms (i) are sound, (ii) can be run efficiently in datasets with thousands of variables, and (iii) significantly outperform in terms of approximating the true neighborhood previous state-of-the-art algorithms using only a fraction of the training size required by the existing methods. A fundamental difference between our approach and existing ones is that the required sample depends on the generating graph connectivity and not the size of the local region; this yields up to exponential savings in sample relative to previously known algorithms. The results presented here are promising not only for discovery of local causal structure, and variable selection for classification, but also for the induction of complete BNs.
International Journal of Medical Informatics | 2005
Alexander R. Statnikov; Ioannis Tsamardinos; Yerbolat Dosbayev; Constantin F. Aliferis
The success of treatment of patients with cancer depends on establishing an accurate diagnosis. To this end, we have built a system called GEMS (gene expression model selector) for the automated development and evaluation of high-quality cancer diagnostic models and biomarker discovery from microarray gene expression data. In order to determine and equip the system with the best performing diagnostic methodologies in this domain, we first conducted a comprehensive evaluation of classification algorithms using 11 cancer microarray datasets. In this paper we present a preliminary evaluation of the system with five new datasets. The performance of the models produced automatically by GEMS is comparable or better than the results obtained by human analysts. Additionally, we performed a cross-dataset evaluation of the system. This involved using a dataset to build a diagnostic model and to estimate its future performance, then applying this model and evaluating its performance on a different dataset. We found that models produced by GEMS indeed perform well in independent samples and, furthermore, the cross-validation performance estimates output by the system approximate well the error obtained by the independent validation. GEMS is freely available for download for non-commercial use from http://www.gems-system.org.
Journal of the American Medical Informatics Association | 2004
Yindalon Aphinyanaphongs; Ioannis Tsamardinos; Alexander R. Statnikov; Douglas P Hardin; Constantin F. Aliferis
OBJECTIVE Finding the best scientific evidence that applies to a patient problem is becoming exceedingly difficult due to the exponential growth of medical publications. The objective of this study was to apply machine learning techniques to automatically identify high-quality, content-specific articles for one time period in internal medicine and compare their performance with previous Boolean-based PubMed clinical query filters of Haynes et al. DESIGN The selection criteria of the ACP Journal Club for articles in internal medicine were the basis for identifying high-quality articles in the areas of etiology, prognosis, diagnosis, and treatment. Naive Bayes, a specialized AdaBoost algorithm, and linear and polynomial support vector machines were applied to identify these articles. MEASUREMENTS The machine learning models were compared in each category with each other and with the clinical query filters using area under the receiver operating characteristic curves, 11-point average recall precision, and a sensitivity/specificity match method. RESULTS In most categories, the data-induced models have better or comparable sensitivity, specificity, and precision than the clinical query filters. The polynomial support vector machine models perform the best among all learning methods in ranking the articles as evaluated by area under the receiver operating curve and 11-point average recall precision. CONCLUSION This research shows that, using machine learning methods, it is possible to automatically build models for retrieving high-quality, content-specific articles using inclusion or citation by the ACP Journal Club as a gold standard in a given time period in internal medicine that perform better than the 1994 PubMed clinical query filters.
Artificial Intelligence in Medicine | 1997
Gregory F. Cooper; Constantin F. Aliferis; Richard Ambrosino; John M. Aronis; Bruce G. Buchanan; Rich Caruana; Michael J. Fine; Clark Glymour; Geoffrey J. Gordon; Barbara H. Hanusa; Janine E. Janosky; Christopher Meek; Tom M. Mitchell; Thomas S. Richardson; Peter Spirtes
This paper describes the application of eight statistical and machine-learning methods to derive computer models for predicting mortality of hospital patients with pneumonia from their findings at initial presentation. The eight models were each constructed based on 9847 patient cases and they were each evaluated on 4352 additional cases. The primary evaluation metric was the error in predicted survival as a function of the fraction of patients predicted to survive. This metric is useful in assessing a models potential to assist a clinician in deciding whether to treat a given patient in the hospital or at home. We examined the error rates of the models when predicting that a given fraction of patients will survive. We examined survival fractions between 0.1 and 0.6. Over this range, each models predictive error rate was within 1% of the error rate of every other model. When predicting that approximately 30% of the patients will survive, all the models have an error rate of less than 1.5%. The models are distinguished more by the number of variables and parameters that they contain than by their error rates; these differences suggest which models may be the most amenable to future implementation as paper-based guidelines.
PLOS ONE | 2012
Jonathan E. Feig; Yuliya Vengrenyuk; Vladimír Reiser; Chaowei Wu; Alexander Statnikov; Constantin F. Aliferis; Michael J. Garabedian; Edward A. Fisher; Oscar Puig
We have developed a mouse model of atherosclerotic plaque regression in which an atherosclerotic aortic arch from a hyperlipidemic donor is transplanted into a normolipidemic recipient, resulting in rapid elimination of cholesterol and monocyte-derived macrophage cells (CD68+) from transplanted vessel walls. To gain a comprehensive view of the differences in gene expression patterns in macrophages associated with regressing compared with progressing atherosclerotic plaque, we compared mRNA expression patterns in CD68+ macrophages extracted from plaque in aortic aches transplanted into normolipidemic or into hyperlipidemic recipients. In CD68+ cells from regressing plaque we observed that genes associated with the contractile apparatus responsible for cellular movement (e.g. actin and myosin) were up-regulated whereas genes related to cell adhesion (e.g. cadherins, vinculin) were down-regulated. In addition, CD68+ cells from regressing plaque were characterized by enhanced expression of genes associated with an anti-inflammatory M2 macrophage phenotype, including arginase I, CD163 and the C-lectin receptor. Our analysis suggests that in regressing plaque CD68+ cells preferentially express genes that reduce cellular adhesion, enhance cellular motility, and overall act to suppress inflammation.
Journal of the American Medical Informatics Association | 2006
Elmer V. Bernstam; Jorge R. Herskovic; Yindalon Aphinyanaphongs; Constantin F. Aliferis; Madurai G. Sriram; William R. Hersh
OBJECTIVE To determine whether algorithms developed for the World Wide Web can be applied to the biomedical literature in order to identify articles that are important as well as relevant. DESIGN AND MEASUREMENTS A direct comparison of eight algorithms: simple PubMed queries, clinical queries (sensitive and specific versions), vector cosine comparison, citation count, journal impact factor, PageRank, and machine learning based on polynomial support vector machines. The objective was to prioritize important articles, defined as being included in a pre-existing bibliography of important literature in surgical oncology. RESULTS Citation-based algorithms were more effective than noncitation-based algorithms at identifying important articles. The most effective strategies were simple citation count and PageRank, which on average identified over six important articles in the first 100 results compared to 0.85 for the best noncitation-based algorithm (p < 0.001). The authors saw similar differences between citation-based and noncitation-based algorithms at 10, 20, 50, 200, 500, and 1,000 results (p < 0.001). Citation lag affects performance of PageRank more than simple citation count. However, in spite of citation lag, citation-based algorithms remain more effective than noncitation-based algorithms. CONCLUSION Algorithms that have proved successful on the World Wide Web can be applied to biomedical information retrieval. Citation-based algorithms can help identify important articles within large sets of relevant results. Further studies are needed to determine whether citation-based algorithms can effectively meet actual user information needs.
Mbio | 2013
Alexander Statnikov; Mikael Henaff; Varun Narendra; Kranti Konganti; Zhiguo Li; Liying Yang; Zhiheng Pei; Martin J. Blaser; Constantin F. Aliferis; Alexander V. Alekseyenko
BackgroundRecent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data.ResultsIn this work, we performed a systematic comparison of 18 major classification methods, 5 feature selection methods, and 2 accuracy metrics using 8 datasets spanning 1,802 human samples and various classification tasks: body site and subject classification and diagnosis.ConclusionsWe found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.
international conference on machine learning | 2004
Douglas P Hardin; Ioannis Tsamardinos; Constantin F. Aliferis
Most prevalent techniques in Support Vector Machine (SVM) feature selection are based on the intuition that the weights of features that are close to zero are not required for optimal classification. In this paper we show that indeed, in the sample limit, the irrelevant variables (in a theoretical and optimal sense) will be given zero weight by a linear SVM, both in the soft and the hard margin case. However, SVM-based methods have certain theoretical disadvantages too. We present examples where the linear SVM may assign zero weights to strongly relevant variables (i.e., variables required for optimal estimation of the distribution of the target variable) and where weakly relevant features (i.e., features that are superfluous for optimal feature selection given other features) may get non-zero weights. We contrast and theoretically compare with Markov-Blanket based feature selection algorithms that do not have such disadvantages in a broad class of distributions and could also be used for causal discovery.