Bill C. White
Dartmouth College
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bill C. White.
The Lancet | 2003
Kiyoshi Yanagisawa; Yu Shyr; Baogang J. Xu; Pierre P. Massion; Paul Larsen; Bill C. White; John Roberts; Mary E. Edgerton; Adriana Gonzalez; Sorena Nadaf; Jason H. Moore; Richard M. Caprioli; David P. Carbone
BACKGROUND Proteomics-based approaches complement the genome initiatives and may be the next step in attempts to understand the biology of cancer. We used matrix-assisted laser desorption/ionisation mass spectrometry directly from 1-mm regions of single frozen tissue sections for profiling of protein expression from surgically resected tissues to classify lung tumours. METHODS Proteomic spectra were obtained and aligned from 79 lung tumours and 14 normal lung tissues. We built a class-prediction model with the proteomic patterns in a training cohort of 42 lung tumours and eight normal lung samples, and assessed their statistical significance. We then applied this model to a blinded test cohort, including 37 lung tumours and six normal lung samples, to estimate the misclassification rate. FINDINGS We obtained more than 1600 protein peaks from histologically selected 1 mm diameter regions of single frozen sections from each tissue. Class-prediction models based on differentially expressed peaks enabled us to perfectly classify lung cancer histologies, distinguish primary tumours from metastases to the lung from other sites, and classify nodal involvement with 85% accuracy in the training cohort. This model nearly perfectly classified samples in the independent blinded test cohort. We also obtained a proteomic pattern comprised of 15 distinct mass spectrometry peaks that distinguished between patients with resected non-small-cell lung cancer who had poor prognosis (median survival 6 months, n=25) and those who had good prognosis (median survival 33 months, n=41, p<0.0001). INTERPRETATION Proteomic patterns obtained directly from small amounts of fresh frozen lung-tumour tissue could be used to accurately classify and predict histological groups as well as nodal involvement and survival in resected non-small-cell lung cancer.
BMC Bioinformatics | 2003
Marylyn D. Ritchie; Bill C. White; Joel Parker; Lance W. Hahn; Jason H. Moore
BackgroundAppropriate definitionof neural network architecture prior to data analysis is crucialfor successful data mining. This can be challenging when the underlyingmodel of the data is unknown. The goal of this study was to determinewhether optimizing neural network architecture using genetic programmingas a machine learning strategy would improve the ability of neural networksto model and detect nonlinear interactions among genes in studiesof common human diseases.ResultsUsing simulateddata, we show that a genetic programming optimized neural network approachis able to model gene-gene interactions as well as a traditionalback propagation neural network. Furthermore, the genetic programmingoptimized neural network is better than the traditional back propagationneural network approach in terms of predictive ability and powerto detect gene-gene interactions when non-functional polymorphismsare present.ConclusionThis study suggeststhat a machine learning strategy for optimizing neural network architecturemay be preferable to traditional trial-and-error approaches forthe identification and characterization of gene-gene interactionsin common, complex human diseases.
Human Heredity | 2004
Scott M. Williams; Marylyn D. Ritchie; John A. Phillips; Elliot Dawson; Melissa A. Prince; Elvira Dzhura; Alecia Willis; Amma Semenya; Marshall L. Summar; Bill C. White; Jonathan H. Addy; John Kpodonu; Lee-Jun Wong; Robin A. Felder; Pedro A. Jose; Jason H. Moore
While hypertension is a complex disease with a well-documented genetic component, genetic studies often fail to replicate findings. One possibility for such inconsistency is that the underlying genetics of hypertension is not based on single genes of major effect, but on interactions among genes. To test this hypothesis, we studied both single locus and multilocus effects, using a case-control design of subjects from Ghana. Thirteen polymorphisms in eight candidate genes were studied. Each candidate gene has been shown to play a physiological role in blood pressure regulation and affects one of four pathways that modulate blood pressure: vasoconstriction (angiotensinogen, angiotensin converting enzyme – ACE, angiotensin II receptor), nitric oxide (NO) dependent and NO independent vasodilation pathways and sodium balance (G protein-coupled receptor kinase, GRK4). We evaluated single site allelic and genotypic associations, multilocus genotype equilibrium and multilocus genotype associations, using multifactor dimensionality reduction (MDR). For MDR, we performed systematic reanalysis of the data to address the role of various physiological pathways. We found no significant single site associations, but the hypertensive class deviated significantly from genotype equilibrium in more than 25% of all multilocus comparisons (2,162 of 8,178), whereas the normotensive class rarely did (11 of 8,178). The MDR analysis identified a two-locus model including ACE and GRK4 that successfully predicted blood pressure phenotype 70.5% of the time. Thus, our data indicate epistatic interactions play a major role in hypertension susceptibility. Our data also support a model where multiple pathways need to be affected in order to predispose to hypertension.
evolutionary computation machine learning and data mining in bioinformatics | 2007
Jason H. Moore; Bill C. White
A dry cleaning process is provided which is effective for preventing wet-soil redeposition and improving stain-release characteristics of polyester textile material. The process advantages are accomplished by a dry cleaning solvent system contaning hydroxypropyl methyl cellulose as an anti-soiling agent.
Genetic Epidemiology | 2009
Kristine A. Pattin; Bill C. White; Nate Barney; Jiang Gui; Heather H. Nelson; Karl R. Kelsey; Angeline S. Andrew; Margaret R. Karagas; Jason H. Moore
Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model‐free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility. The goal of MDR is to change the representation of the data using a constructive induction algorithm to make nonadditive interactions easier to detect using any classification method such as naïve Bayes or logistic regression. Traditionally, MDR constructed variables have been evaluated with a naïve Bayes classifier that is combined with 10‐fold cross validation to obtain an estimate of predictive accuracy or generalizability of epistasis models. Traditionally, we have used permutation testing to statistically evaluate the significance of models obtained through MDR. The advantage of permutation testing is that it controls for false positives due to multiple testing. The disadvantage is that permutation testing is computationally expensive. This is an important issue that arises in the context of detecting epistasis on a genome‐wide scale. The goal of the present study was to develop and evaluate several alternatives to large‐scale permutation testing for assessing the statistical significance of MDR models. Using data simulated from 70 different epistasis models, we compared the power and type I error rate of MDR using a 1,000‐fold permutation test with hypothesis testing using an extreme value distribution (EVD). We find that this new hypothesis testing method provides a reasonable alternative to the computationally expensive 1,000‐fold permutation test and is 50 times faster. We then demonstrate this new method by applying it to a genetic epidemiology study of bladder cancer susceptibility that was previously analyzed using MDR and assessed using a 1,000‐fold permutation test. Genet. Epidemiol. 2008.
parallel problem solving from nature | 2006
Jason H. Moore; Bill C. White
Human genetics is undergoing an information explosion. The availability of chip-based technology facilitates the measurement of thousands of DNA sequence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA sequence variations that are predictive of common diseases. The goal of this paper was to develop and evaluate a genetic programming (GP) approach for attribute selection and modeling that uses expert knowledge such as Tuned ReliefF (TuRF) scores during selection to ensure trees with good building blocks are recombined and reproduced. We show here that using expert knowledge to select trees performs as well as a multiobjective fitness function but requires only a tenth of the population size. This study demonstrates that GP may be a useful computational discovery tool in this domain.
Expert Review of Proteomics | 2004
David M. Reif; Bill C. White; Jason H. Moore
The rapid expansion of methods for measuring biological data ranging from DNA sequence variations to mRNA expression and protein abundance presents the opportunity to utilize multiple types of information jointly in the study of human health and disease. Organisms are complex systems that integrate inputs at myriad levels to arrive at an observable phenotype. Therefore, it is essential that questions concerning the etiology of phenotypes as complex as common human diseases take the systemic nature of biology into account, and integrate the information provided by each data type in a manner analogous to the operation of the body itself. While limited in scope, the initial forays into the joint analysis of multiple data types have yielded interesting results that would not have been reached had only one type of data been considered. These early successes, along with the aforementioned theoretical appeal of data integration, provide impetus for the development of methods for the parallel, high-throughput analysis of multiple data types. The idea that the integrated analysis of multiple data types will improve the identification of biomarkers of clinical endpoints, such as disease susceptibility, is presented as a working hypothesis.
Archive | 2007
Jason H. Moore; Bill C. White
Human genetics is undergoing an information explosion. The availability of chip-based technology facilitates the measurement of thousands of DNA sequence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA sequence variations that are predictive of common diseases. The goal of this study is to develop and evaluate a genetic programming (GP) approach to attribute selection and classification in this domain. We simulated genetic datasets of varying size in which the disease model consists of two interacting DNA sequence variations that exhibit no independent effects on class (i.e. epistasis). We show that GP is no better than a simple random search when classification accuracy is used as the fitness function. We then show that including pre-processed estimates of attribute quality using Tuned ReliefF (TuRF) in a multi-objective fitness function that also includes accuracy significantly improves the performance of GP over that of random search. This study demonstrates that GP may be a useful computational discovery tool in this domain. This study raises important questions about the general utility of GP for these types of problems, the importance of data preprocessing, the ideal functional form of the fitness function, and the importance of expert knowledge. We anticipate this study will provide an important baseline for future studies investigating the usefulness of GP as a general computational discovery tool for large-scale genetic studies.
ant colony optimization and swarm intelligence | 2008
Casey S. Greene; Bill C. White; Jason H. Moore
In human genetics it is now feasible to measure large numbers of DNA sequence variations across the human genome. Given current knowledge about biological networks and disease processes it seems likely that disease risk can best be modeled by interactions between biological components, which can be examined as interacting DNA sequence variations. The machine learning challenge is to effectively explore interactions in these datasets to identify combinations of variations which are predictive of common human diseases. Ant colony optimization (ACO) is a promising approach to this problem. The goal of this study is to examine the usefulness of ACO for problems in this domain and to develop a prototype of an expert knowledge guided probabilistic search wrapper. We show that an ACO approach is not successful in the absence of expert knowledge but is successful when expert knowledge is supplied through the pheromone updating rule.
Human Heredity | 2007
Jason H. Moore; Nate Barney; Chia-Ti Tsai; Fu-Tien Chiang; Jiang Gui; Bill C. White
The workhorse of modern genetic analysis is the parametric linear model. The advantages of the linear modeling framework are many and include a mathematical understanding of the model fitting process and ease of interpretation. However, an important limitation is that linear models make assumptions about the nature of the data being modeled. This assumption may not be realistic for complex biological systems such as disease susceptibility where nonlinearities in the genotype to phenotype mapping relationship that result from epistasis, plastic reaction norms, locus heterogeneity, and phenocopy, for example, are the norm rather than the exception. We have previously developed a flexible modeling approach called symbolic discriminant analysis (SDA) that makes no assumptions about the patterns in the data. Rather, SDA lets the data dictate the size, shape, and complexity of a symbolic discriminant function that could include any set of mathematical functions from a list of candidates supplied by the user. Here, we outline a new five step process for symbolic model discovery that uses genetic programming (GP) for coarse-grained stochastic searching, experimental design for parameter optimization, graphical modeling for generating expert knowledge, and estimation of distribution algorithms for fine-grained stochastic searching. Finally, we introduce function mapping as a new method for interpreting symbolic discriminant functions. We show that function mapping when combined with measures of interaction information facilitates statistical interpretation by providing a graphical approach to decomposing complex models to highlight synergistic, redundant, and independent effects of polymorphisms and their composite functions. We illustrate this five step SDA modeling process with a real case-control dataset.