Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yuanyuan Xiao is active.

Publication


Featured researches published by Yuanyuan Xiao.


Bioinformatics | 2005

Identifying differentially expressed genes from microarray experiments via statistic synthesis

Yee Hwa Yang; Yuanyuan Xiao; Mark R. Segal

MOTIVATIONnA common objective of microarray experiments is the detection of differential gene expression between samples obtained under different conditions. The task of identifying differentially expressed genes consists of two aspects: ranking and selection. Numerous statistics have been proposed to rank genes in order of evidence for differential expression. However, no one statistic is universally optimal and there is seldom any basis or guidance that can direct toward a particular statistic of choice.nnnRESULTSnOur new approach, which addresses both ranking and selection of differentially expressed genes, integrates differing statistics via a distance synthesis scheme. Using a set of (Affymetrix) spike-in datasets, in which differentially expressed genes are known, we demonstrate that our method compares favorably with the best individual statistics, while achieving robustness properties lacked by the individual statistics. We further evaluate performance on one other microarray study.


international conference on bioinformatics | 2007

A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays

Yuanyuan Xiao; Mark R. Segal; Yee Hwa Yang; Ru-Fang Yeh

MOTIVATIONnModern strategies for mapping disease loci require efficient genotyping of a large number of known polymorphic sites in the genome. The sensitive and high-throughput nature of hybridization-based DNA microarray technology provides an ideal platform for such an application by interrogating up to hundreds of thousands of single nucleotide polymorphisms (SNPs) in a single assay. Similar to the development of expression arrays, these genotyping arrays pose many data analytic challenges that are often platform specific. Affymetrix SNP arrays, e.g. use multiple sets of short oligonucleotide probes for each known SNP, and require effective statistical methods to combine these probe intensities in order to generate reliable and accurate genotype calls.nnnRESULTSnWe developed an integrated multi-SNP, multi-array genotype calling algorithm for Affymetrix SNP arrays, MAMS, that combines single-array multi-SNP (SAMS) and multi-array, single-SNP (MASS) calls to improve the accuracy of genotype calls, without the need for training data or computation-intensive normalization procedures as in other multi-array methods. The algorithm uses resampling techniques and model-based clustering to derive single array based genotype calls, which are subsequently refined by competitive genotype calls based on (MASS) clustering. The resampling scheme caps computation for single-array analysis and hence is readily scalable, important in view of expanding numbers of SNPs per array. The MASS update is designed to improve calls for atypical SNPs, harboring allele-imbalanced binding affinities, that are difficult to genotype without information from other arrays. Using a publicly available data set of HapMap samples from Affymetrix, and independent calls by alternative genotyping methods from the HapMap project, we show that our approach performs competitively to existing methods.nnnAVAILABILITYnR functions are available upon request from the authors.


Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery | 2011

Multivariate random forests

Mark R. Segal; Yuanyuan Xiao

Random forests have emerged as a versatile and highly accurate classification and regression methodology, requiring little tuning and providing interpretable outputs. Here, we briefly outline the genesis of, and motivation for, the random forest paradigm as an outgrowth from earlier tree‐structured techniques. We elaborate on aspects of prediction error and attendant tuning parameter issues. However, our emphasis is on extending the random forest schema to the multiple response setting. We provide a simple illustrative example from ecology that showcases the improved fit and enhanced interpretation afforded by the random forest framework.


BMC Genomics | 2002

Assessment of differential gene expression in human peripheral nerve injury

Yuanyuan Xiao; Mark R. Segal; Douglas Kenneth Rabert; Andrew H. Ahn; Praveen Anand; Lakshmi Sangameswaran; Donglei Hu; C. Anthony Hunt

BackgroundMicroarray technology is a powerful methodology for identifying differentially expressed genes. However, when thousands of genes in a microarray data set are evaluated simultaneously by fold changes and significance tests, the probability of detecting false positives rises sharply. In this first microarray study of brachial plexus injury, we applied and compared the performance of two recently proposed algorithms for tackling this multiple testing problem, Significance Analysis of Microarrays (SAM) and Westfall and Young step down adjusted p values, as well as t-statistics and Welch statistics, in specifying differential gene expression under different biological states.ResultsUsing SAM based on t statistics, we identified 73 significant genes, which fall into different functional categories, such as cytokines / neurotrophin, myelin function and signal transduction. Interestingly, all but one gene were down-regulated in the patients. Using Welch statistics in conjunction with SAM, we identified an additional set of up-regulated genes, several of which are engaged in transcription and translation regulation. In contrast, the Westfall and Young algorithm identified only one gene using a conventional significance level of 0.05.ConclusionIn coping with multiple testing problems, Family-wise type I error rate (FWER) and false discovery rate (FDR) are different expressions of Type I error rates. The Westfall and Young algorithm controls FWER. In the context of this microarray study, it is, seemingly, too conservative. In contrast, SAM, by controlling FDR, provides a promising alternative. In this instance, genes selected by SAM were shown to be biologically meaningful.


PLOS Computational Biology | 2005

Analysis of a Splice Array Experiment Elucidates Roles of Chromatin Elongation Factor Spt4-5 in Splicing

Yuanyuan Xiao; Yee Hwa Yang; Todd Burckin; Lily Shiue; Grant A. Hartzog; Mark R. Segal

Splicing is an important process for regulation of gene expression in eukaryotes, and it has important functional links to other steps of gene expression. Two examples of these linkages include Ceg1, a component of the mRNA capping enzyme, and the chromatin elongation factors Spt4–5, both of which have recently been shown to play a role in the normal splicing of several genes in the yeast Saccharomyces cerevisiae. Using a genomic approach to characterize the roles of Spt4–5 in splicing, we used splicing-sensitive DNA microarrays to identify specific sets of genes that are mis-spliced in ceg1, spt4, and spt5 mutants. In the context of a complex, nested, experimental design featuring 22 dye-swap array hybridizations, comprising both biological and technical replicates, we applied five appropriate statistical models for assessing differential expression between wild-type and the mutants. To refine selection of differential expression genes, we then used a robust model-synthesizing approach, Differential Expression via Distance Synthesis, to integrate all five models. The resultant list of differentially expressed genes was then further analyzed with regard to select attributes: we found that highly transcribed genes with long introns were most sensitive to spt mutations. QPCR confirmation of differential expression was established for the limited number of genes evaluated. In this paper, we showcase splicing array technology, as well as powerful, yet general, statistical methodology for assessing differential expression, in the context of a real, complex experimental design. Our results suggest that the Spt4–Spt5 complex may help coordinate splicing with transcription under conditions that present kinetic challenges to spliceosome assembly or function.


international conference on bioinformatics | 2005

Prediction of Genomewide Conserved Epitope Profiles of HIV-1: Classifier Choice and Peptide Representation

Yuanyuan Xiao; Mark R. Segal

Identification of peptides binding to Major Histocompatibility Complex (MHC) molecules is important for accelerating vaccine development and improving immunotherapy. Accordingly, a wide variety of prediction methods have been applied in this context. In this paper, we introduce (tree-based) ensemble classifiers for such problems and contrast their predictive performance with forefront existing methods for both MHC class I and class II molecules. In addition, we investigate the impact of differing peptide representation schemes on performance. Finally, classifier predictions are used to conduct genomewide scans of a diverse collection of HIV-1 strains, enabling assessment of epitope conservation. We investigated all combinations of six classification methods (classification trees, artificial neural networks, support vector machines, as well as the more recently devised ensemble methods (bagging, random forests, boosting) with four peptide representation schemes (amino acid sequence, select biophysical properties, select quantitative structure-activity relationship (QSAR) descriptors, and the combination of the latter two) in predicting peptide binding to an MHC class I molecule (HLA-A2) and MHC class II molecule (HLA-DR4). Our results show that the ensemble methods are consistently more accurate than the other three alternatives. Furthermore, they are robust with respect to parameter tuning. Among the four representation schemes, the amino acid sequence representation gave consistently (across classifiers) best results. This finding obviates the need for feature selection strategies incurred by use of biophysical and/or QSAR properties. We obtained, and aligned, a diverse set of 32 HIV-1 genomes and pursued genomewide HLA-DR4 epitope profiling by querying with respect to classifier predictions, as obtained under each of the four peptide representation schemes. We validated those epitopes conserved across strains against known T-cell epitopes. Once again, amino acid sequence representation was at least as effective as using properties. Assessment of novel epitope predictions awaits experimental verification.


Journal of Clinical Neuroscience | 2004

Plasticity of gene expression in injured human dorsal root ganglia revealed by GeneChip oligonucleotide microarrays.

Douglas Kenneth Rabert; Yuanyuan Xiao; Yiangos Yiangou; Dirk Kreder; Lakshmi Sangameswaran; Mark R. Segal; C. Anthony Hunt; Rolfe Birch; Praveen Anand

Root avulsion from the spinal cord occurs in brachial plexus lesions. It is the practice to repair such injuries by transferring an intact neighbouring nerve to the distal stump of the damaged nerve; avulsed dorsal root ganglia (DRG) are removed to enable nerve transfer. Such avulsed adult human cervical DRG ( [Formula: see text] ) obtained at surgery were compared to controls, for the first time, using GeneChip oligonucleotide arrays. We report 91 genes whose expression levels are clearly altered by the injury. This first study provides a global assessment of the molecular events or gene switches as a consequence of DRG injuries, as the tissues represent a wide range of surgical delay, from 1 to 100 days. A number of these genes are novel with respect to sensory ganglia, while others are known to be involved in neurotransmission, trophism, cytokine functions, signal transduction, myelination, transcription regulation, and apoptosis. Cluster analysis showed that genes involved in the same functional groups are largely positioned close to each other. This study represents an important step in identifying new genes and molecular mechanisms in human DRG, with potential therapeutic relevance for nerve repair and relief of chronic neuropathic pain.


PLOS Computational Biology | 2009

Identification of Yeast Transcriptional Regulation Networks Using Multivariate Random Forests

Yuanyuan Xiao; Mark R. Segal

The recent availability of whole-genome scale data sets that investigate complementary and diverse aspects of transcriptional regulation has spawned an increased need for new and effective computational approaches to analyze and integrate these large scale assays. Here, we propose a novel algorithm, based on random forest methodology, to relate gene expression (as derived from expression microarrays) to sequence features residing in gene promoters (as derived from DNA motif data) and transcription factor binding to gene promoters (as derived from tiling microarrays). We extend the random forest approach to model a multivariate response as represented, for example, by time-course gene expression measures. An analysis of the multivariate random forest output reveals complex regulatory networks, which consist of cohesive, condition-dependent regulatory cliques. Each regulatory clique features homogeneous gene expression profiles and common motifs or synergistic motif groups. We apply our method to several yeast physiological processes: cell cycle, sporulation, and various stress conditions. Our technique displays excellent performance with regard to identifying known regulatory motifs, including high order interactions. In addition, we present evidence of the existence of an alternative MCB-binding pathway, which we confirm using data from two independent cell cycle studies and two other physioloigical processes. Finally, we have uncovered elaborate transcription regulation refinement mechanisms involving PAC and mRRPE motifs that govern essential rRNA processing. These include intriguing instances of differing motif dosages and differing combinatorial motif control that promote regulatory specificity in rRNA metabolism under differing physiological processes.


Bioinformatics | 2008

Biological sequence classification utilizing positive and unlabeled data

Yuanyuan Xiao; Mark R. Segal

MOTIVATIONnIn the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data.nnnRESULTSnHere, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies--prediction of HLA binding, and alternative splicing conservation between human and mouse--we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.


Biostatistics | 2011

Clustering with exclusion zones: genomic applications

Mark R. Segal; Yuanyuan Xiao; Fred W. Huffer

Methods for formally evaluating the clustering of events in space or time, notably the scan statistic, have been richly developed and widely applied. In order to utilize the scan statistic and related approaches, it is necessary to know the extent of the spatial or temporal domains wherein the events arise. Implicit in their usage is that these domains have no holes-hereafter exclusion zones-regions in which events a priori cannot occur. However, in many contexts, this requirement is not met. When the exclusion zones are known, it is straightforward to correct the scan statistic for their occurrence by simply adjusting the extent of the domain. Here, we tackle the more ambitious objective of formally evaluating clustering in the presence of unknown exclusion zones. We develop an algorithm for estimating total exclusion zone extent, the quantity needed to correct scan statistic-based inference, using distributional properties of spacings, and show how bias correction for this estimator can be effected. Performance of the algorithm is assessed via simulation study. We showcase applications to genomic settings for differing marker (event) types-binding sites, housekeeping genes, and microRNAs-wherein exclusion zones can arise through a variety of mechanisms. In several instances, dramatic changes to unadjusted inference that does not accommodate exclusions are evidenced.

Collaboration


Dive into the Yuanyuan Xiao's collaboration.

Top Co-Authors

Avatar

Mark R. Segal

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

C.A. Hunt

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Donglei Hu

University of California

View shared research outputs
Researchain Logo
Decentralizing Knowledge