Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Daniela M. Witten is active.

Publication


Featured researches published by Daniela M. Witten.


Technometrics | 2011

Sparse Discriminant Analysis

Line Katrine Harder Clemmensen; Trevor Hastie; Daniela M. Witten; Bjarne Kjær Ersbøll

We consider the problem of performing interpretable classification in the high-dimensional setting, in which the number of features is very large and the number of observations is limited. This setting has been studied extensively in the chemometrics literature, and more recently has become commonplace in biological and medical applications. In this setting, a traditional approach involves performing feature selection before classification. We propose sparse discriminant analysis, a method for performing linear discriminant analysis with a sparseness criterion imposed such that classification and feature selection are performed simultaneously. Sparse discriminant analysis is based on the optimal scoring interpretation of linear discriminant analysis, and can be extended to perform sparse discrimination via mixtures of Gaussians if boundaries between classes are nonlinear or if subgroups are present within each class. Our proposal also provides low-dimensional views of the discriminative directions.


Nature Biotechnology | 2012

Massively parallel functional dissection of mammalian enhancers in vivo

Rupali P Patwardhan; Joseph Hiatt; Daniela M. Witten; Mee J. Kim; Robin P. Smith; Dalit May; Choli Lee; Jennifer M. Andrie; Su-In Lee; Gregory M. Cooper; Nadav Ahituv; Len A. Pennacchio; Jay Shendure

The functional consequences of genetic variation in mammalian regulatory elements are poorly understood. We report the in vivo dissection of three mammalian enhancers at single-nucleotide resolution through a massively parallel reporter assay. For each enhancer, we synthesized a library of >100,000 mutant haplotypes with 2–3% divergence from the wild-type sequence. Each haplotype was linked to a unique sequence tag embedded within a transcriptional cassette. We introduced each enhancer library into mouse liver and measured the relative activities of individual haplotypes en masse by sequencing the transcribed tags. Linear regression analysis yielded highly reproducible estimates of the effect of every possible single-nucleotide change on enhancer activity. The functional consequence of most mutations was modest, with ∼22% affecting activity by >1.2-fold and ∼3% by >2-fold. Several, but not all, positions with higher effects showed evidence for purifying selection, or co-localized with known liver-associated transcription factor binding sites, demonstrating the value of empirical high-resolution functional analysis.


Journal of the American Statistical Association | 2010

A Framework for Feature Selection in Clustering

Daniela M. Witten; Robert Tibshirani

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated and genomic data.


Cell Stem Cell | 2009

Hierarchical Maintenance of MLL Myeloid Leukemia Stem Cells Employs a Transcriptional Program Shared with Embryonic Rather Than Adult Stem Cells

Tim C.P. Somervaille; Christina Matheny; Gary J. Spencer; Masayuki Iwasaki; John L. Rinn; Daniela M. Witten; Howard Y. Chang; Sheila A. Shurtleff; James R. Downing; Michael L. Cleary

The genetic programs that promote retention of self-renewing leukemia stem cells (LSCs) at the apex of cellular hierarchies in acute myeloid leukemia (AML) are not known. In a mouse model of human AML, LSCs exhibit variable frequencies that correlate with the initiating MLL oncogene and are maintained in a self-renewing state by a transcriptional subprogram more akin to that of embryonic stem cells (ESCs) than to that of adult stem cells. The transcription/chromatin regulatory factors Myb, Hmgb3, and Cbx5 are critical components of the program and suffice for Hoxa/Meis-independent immortalization of myeloid progenitors when coexpressed, establishing the cooperative and essential role of an ESC-like LSC maintenance program ancillary to the leukemia-initiating MLL/Hox/Meis program. Enriched expression of LSC maintenance and ESC-like program genes in normal myeloid progenitors and poor-prognosis human malignancies links the frequency of aberrantly self-renewing progenitor-like cancer stem cells (CSCs) to prognosis in human cancer.


Statistical Applications in Genetics and Molecular Biology | 2009

Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data

Daniela M. Witten; Robert Tibshirani

In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.


Journal of Investigative Dermatology | 2010

MicroRNA Expression Profiles Associated with Mutational Status and Survival in Malignant Melanoma

Stefano Caramuta; Suzanne Egyhazi; Monica Rodolfo; Daniela M. Witten; Johan Hansson; Catharina Larsson; Weng-Onn Lui

Malignant cutaneous melanoma is a highly aggressive form of skin cancer. Despite improvements in early melanoma diagnosis, the 5-year survival rate remains low in advanced disease. Therefore, novel biomarkers are urgently needed to devise new means of detection and treatment. In this study, we aimed to improve our understanding of microRNA (miRNA) deregulation in melanoma development and their impact on patient survival. Global miRNA expression profiles of a set of melanoma lymph node metastases, melanoma cell lines, and melanocyte cultures were determined using Agilent array. Deregulated miRNAs were evaluated in relation with clinical characteristics, patient survival, and mutational status for BRAF and NRAS. Several miRNAs were differentially expressed between melanocytes and melanomas as well as melanoma cell lines. In melanomas, miR-193a, miR-338, and miR-565 were underexpressed in cases with a BRAF mutation. Furthermore, low expression of miR-191 and high expression of miR-193b were associated with poor melanoma-specific survival. In conclusion, our findings show miRNA dysregulation in malignant melanoma and its relation to established molecular backgrounds of BRAF and NRAS oncogenic mutations. The identification of an miRNA classifier for poor survival may lead to the development of miRNA detection as a complementary prognostic tool in clinical practice.


Biostatistics | 2012

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

Jun Li; Daniela M. Witten; Iain M. Johnstone; Robert Tibshirani

We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.


Journal of Computational and Graphical Statistics | 2011

New Insights and Faster Computations for the Graphical Lasso

Daniela M. Witten; Jerome H. Friedman; Noah Simon

We consider the graphical lasso formulation for estimating a Gaussian graphical model in the high-dimensional setting. This approach entails estimating the inverse covariance matrix under a multivariate normal model by maximizing the ℓ1-penalized log-likelihood. We present a very simple necessary and sufficient condition that can be used to identify the connected components in the graphical lasso solution. The condition can be employed to determine whether the estimated inverse covariance matrix will be block diagonal, and if so, then to identify the blocks. This in turn can lead to drastic speed improvements, since one can simply apply a standard graphical lasso algorithm to each block separately. Moreover, the necessary and sufficient condition provides insight into the graphical lasso solution: the set of connected nodes at any given tuning parameter value is a superset of the set of connected nodes at any larger tuning parameter value. This article has supplementary material online.


Genome Biology | 2012

Transcriptional profiling of long non-coding RNAs and novel transcribed regions across a diverse panel of archived human cancers

Alayne L Brunner; Andrew H. Beck; Badreddin Edris; Robert T. Sweeney; Shirley Zhu; Rui Li; Kelli Montgomery; Sushama Varma; Thea Gilks; Xiangqian Guo; Joseph W. Foley; Daniela M. Witten; Craig P. Giacomini; Ryan A. Flynn; Jonathan R. Pollack; Robert Tibshirani; Howard Y. Chang; Matt van de Rijn; Robert B. West

BackgroundMolecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. However, despite the potential importance of long non-coding RNAs to the cancer field, no comprehensive survey of long non-coding RNA expression across various cancers has been reported.ResultsWe performed a sequencing-based transcriptional survey of both known long non-coding RNAs and novel intergenic transcripts across a panel of 64 archival tumor samples comprising 17 diagnostic subtypes of adenocarcinomas, squamous cell carcinomas and sarcomas. We identified hundreds of transcripts from among the known 1,065 long non-coding RNAs surveyed that showed variability in transcript levels between the tumor types and are therefore potential biomarker candidates. We discovered 1,071 novel intergenic transcribed regions and demonstrate that these show similar patterns of variability between tumor types. We found that many of these differentially expressed cancer transcripts are also expressed in normal tissues. One such novel transcript specifically expressed in breast tissue was further evaluated using RNA in situ hybridization on a panel of breast tumors. It was shown to correlate with low tumor grade and estrogen receptor expression, thereby representing a potentially important new breast cancer biomarker.ConclusionsThis study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research.


BMC Biology | 2010

Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls

Daniela M. Witten; Robert Tibshirani; Sam Guoping Gu; Andrew Fire; Weng-Onn Lui

BackgroundUltra-high throughput sequencing technologies provide opportunities both for discovery of novel molecular species and for detailed comparisons of gene expression patterns. Small RNA populations are particularly well suited to this analysis, as many different small RNAs can be completely sequenced in a single instrument run.ResultsWe prepared small RNA libraries from 29 tumour/normal pairs of human cervical tissue samples. Analysis of the resulting sequences (42 million in total) defined 64 new human microRNA (miRNA) genes. Both arms of the hairpin precursor were observed in twenty-three of the newly identified miRNA candidates. We tested several computational approaches for the analysis of class differences between high throughput sequencing datasets and describe a novel application of a log linear model that has provided the most effective analysis for this data. This method resulted in the identification of 67 miRNAs that were differentially-expressed between the tumour and normal samples at a false discovery rate less than 0.001.ConclusionsThis approach can potentially be applied to any kind of RNA sequencing data for analysing differential sequence representation between biological sample sets.

Collaboration


Dive into the Daniela M. Witten's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ali Shojaie

University of Washington

View shared research outputs
Top Co-Authors

Avatar

Kean Ming Tan

University of Washington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Noah Simon

University of Washington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Su-In Lee

University of Washington

View shared research outputs
Top Co-Authors

Avatar

Andrew H. Beck

Beth Israel Deaconess Medical Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jay Shendure

University of Washington

View shared research outputs
Researchain Logo
Decentralizing Knowledge