Yuliya V. Karpievitch
University of Western Australia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yuliya V. Karpievitch.
Bioinformatics | 2009
Yuliya V. Karpievitch; Jeffrey R. Stanley; Thomas Taverner; Jianhua Huang; Joshua N. Adkins; Charles Ansong; Fred Heffron; Thomas O. Metz; Wei Jun Qian; Hyunjin Yoon; Richard D. Smith; Alan R. Dabney
MOTIVATION Quantitative mass spectrometry-based proteomics requires protein-level estimates and associated confidence measures. Challenges include the presence of low quality or incorrectly identified peptides and informative missingness. Furthermore, models are required for rolling peptide-level information up to the protein level. RESULTS We present a statistical model that carefully accounts for informative missingness in peak intensities and allows unbiased, model-based, protein-level estimation and inference. The model is applicable to both label-based and label-free quantitation experiments. We also provide automated, model-based, algorithms for filtering of proteins and peptides as well as imputation of missing values. Two LC/MS datasets are used to illustrate the methods. In simulation studies, our methods are shown to achieve substantially more discoveries than standard alternatives. AVAILABILITY The software has been made available in the open-source proteomics platform DAnTE (http://omics.pnl.gov/software/).
BMC Bioinformatics | 2012
Yuliya V. Karpievitch; Alan R. Dabney; Richard D. Smith
Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data.
Bioinformatics | 2012
Thomas Taverner; Yuliya V. Karpievitch; Ashoka D. Polpitiya; Joseph N. Brown; Alan R. Dabney; Gordon A. Anderson; Richard D. Smith
MOTIVATION The size and complex nature of mass spectrometry-based proteomics datasets motivate development of specialized software for statistical data analysis and exploration. We present DanteR, a graphical R package that features extensive statistical and diagnostic functions for quantitative proteomics data analysis, including normalization, imputation, hypothesis testing, interactive visualization and peptide-to-protein rollup. More importantly, users can easily extend the existing functionality by including their own algorithms under the Add-On tab. AVAILABILITY DanteR and its associated user guide are available for download free of charge at http://omics.pnl.gov/software/. We have an updated binary source for the DanteR package up on our website together with a vignettes document. For Windows, a single click automatically installs DanteR along with the R programming environment. For Linux and Mac OS X, users must install R and then follow instructions on the DanteR website for package installation. CONTACT [email protected].
Journal of Proteome Research | 2012
Victor P. Andreev; Vladislav A. Petyuk; Heather M. Brewer; Yuliya V. Karpievitch; Fang Xie; Jennifer Clarke; David G. Camp; Richard D. Smith; Andrew P. Lieberman; Roger L. Albin; Zafar Nawaz; Jimmy El Hokayem; Amanda J. Myers
Quantitative proteomics analysis of cortical samples of 10 Alzheimers disease (AD) brains versus 10 normally aged brains was performed by following the accurate mass and time tag (AMT) approach with the high resolution LTQ Orbitrap mass spectrometer. More than 1400 proteins were identified and quantitated. A conservative approach of selecting only the consensus results of four normalization methods was suggested and used. A total of 197 proteins were shown to be significantly differentially abundant (p-values <0.05, corrected for multiplicity of testing) in AD versus control brain samples. Thirty-seven of these proteins were reported as differentially abundant or modified in AD in previous proteomics and transcriptomics publications. The rest to the best of our knowledge are new. Mapping of the discovered proteins with bioinformatic tools revealed significant enrichment with differentially abundant proteins of pathways and processes known to be important in AD, including signal transduction, regulation of protein phosphorylation, immune response, cytoskeleton organization, lipid metabolism, energy production, and cell death.
Bioinformatics | 2009
Yuliya V. Karpievitch; Thomas Taverner; Joshua N. Adkins; Stephen J. Callister; Gordon A. Anderson; Richard D. Smith; Alan R. Dabney
MOTIVATION LC-MS allows for the identification and quantification of proteins from biological samples. As with any high-throughput technology, systematic biases are often observed in LC-MS data, making normalization an important preprocessing step. Normalization models need to be flexible enough to capture biases of arbitrary complexity, while avoiding overfitting that would invalidate downstream statistical inference. Careful normalization of MS peak intensities would enable greater accuracy and precision in quantitative comparisons of protein abundance levels. RESULTS We propose an algorithm, called EigenMS, that uses singular value decomposition to capture and remove biases from LC-MS peak intensity measurements. EigenMS is an adaptation of the surrogate variable analysis (SVA) algorithm of Leek and Storey, with the adaptations including (i) the handling of the widespread missing measurements that are typical in LC-MS, and (ii) a novel approach to preventing overfitting that facilitates the incorporation of EigenMS into an existing proteomics analysis pipeline. EigenMS is demonstrated using both large-scale calibration measurements and simulations to perform well relative to existing alternatives. AVAILABILITY The software has been made available in the open source proteomics platform DAnTE (Polpitiya et al., 2008)) (http://omics.pnl.gov/software/), as well as in standalone software available at SourceForge (http://sourceforge.net).
Bioinformatics | 2007
Yuliya V. Karpievitch; Elizabeth G. Hill; Adam J. Smolka; Jeffrey S. Morris; Kevin R. Coombes; Keith A. Baggerly; Jonas S. Almeida
UNLABELLED We introduce a simple-to-use graphical tool that enables researchers to easily prepare time-of-flight mass spectrometry data for analysis. For ease of use, the graphical executable provides default parameter settings, experimentally determined to work well in most situations. These values, if desired, can be changed by the user. PrepMS is a stand-alone application made freely available (open source), and is under the General Public License (GPL). Its graphical user interface, default parameter settings, and display plots allow PrepMS to be used effectively for data preprocessing, peak detection and visual data quality assessment. AVAILABILITY Stand-alone executable files and Matlab toolbox are available for download at: http://sourceforge.net/projects/prepms
Analytical Chemistry | 2011
Jeffrey R. Stanley; Joshua N. Adkins; Gordon W. Slysz; Matthew E. Monroe; Samuel O. Purvine; Yuliya V. Karpievitch; Gordon A. Anderson; Richard D. Smith; Alan R. Dabney
Current algorithms for quantifying peptide identification confidence in the accurate mass and time (AMT) tag approach assume that the AMT tags themselves have been correctly identified. However, there is uncertainty in the identification of AMT tags, because this is based on matching LC-MS/MS fragmentation spectra to peptide sequences. In this paper, we incorporate confidence measures for the AMT tag identifications into the calculation of probabilities for correct matches to an AMT tag database, resulting in a more accurate overall measure of identification confidence for the AMT tag approach. The method is referenced as Statistical Tools for AMT Tag Confidence (STAC). STAC additionally provides a uniqueness probability (UP) to help distinguish between multiple matches to an AMT tag and a method to calculate an overall false discovery rate (FDR). STAC is freely available for download, as both a command line and a Windows graphical application.
PLOS ONE | 2009
Yuliya V. Karpievitch; Elizabeth G. Hill; Anthony P. Leclerc; Alan R. Dabney; Jonas S. Almeida
Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breimans Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.
eLife | 2016
Tim Stuart; Steven R Eichten; Jonathan Cahn; Yuliya V. Karpievitch; Justin O. Borevitz; Ryan Lister
Variation in the presence or absence of transposable elements (TEs) is a major source of genetic variation between individuals. Here, we identified 23,095 TE presence/absence variants between 216 Arabidopsis accessions. Most TE variants were rare, and we find these rare variants associated with local extremes of gene expression and DNA methylation levels within the population. Of the common alleles identified, two thirds were not in linkage disequilibrium with nearby SNPs, implicating these variants as a source of novel genetic diversity. Many common TE variants were associated with significantly altered expression of nearby genes, and a major fraction of inter-accession DNA methylation differences were associated with nearby TE insertions. Overall, this demonstrates that TE variants are a rich source of genetic diversity that likely plays an important role in facilitating epigenomic and transcriptional differences between individuals, and indicates a strong genetic basis for epigenetic variation. DOI: http://dx.doi.org/10.7554/eLife.20777.001
PLOS ONE | 2014
Yuliya V. Karpievitch; Sonja B. Nikolic; Richard Wilson; James E. Sharman; Lindsay M. Edwards
Liquid chromatography mass spectrometry has become one of the analytical platforms of choice for metabolomics studies. However, LC-MS metabolomics data can suffer from the effects of various systematic biases. These include batch effects, day-to-day variations in instrument performance, signal intensity loss due to time-dependent effects of the LC column performance, accumulation of contaminants in the MS ion source and MS sensitivity among others. In this study we aimed to test a singular value decomposition-based method, called EigenMS, for normalization of metabolomics data. We analyzed a clinical human dataset where LC-MS serum metabolomics data and physiological measurements were collected from thirty nine healthy subjects and forty with type 2 diabetes and applied EigenMS to detect and correct for any systematic bias. EigenMS works in several stages. First, EigenMS preserves the treatment group differences in the metabolomics data by estimating treatment effects with an ANOVA model (multiple fixed effects can be estimated). Singular value decomposition of the residuals matrix is then used to determine bias trends in the data. The number of bias trends is then estimated via a permutation test and the effects of the bias trends are eliminated. EigenMS removed bias of unknown complexity from the LC-MS metabolomics data, allowing for increased sensitivity in differential analysis. Moreover, normalized samples better correlated with both other normalized samples and corresponding physiological data, such as blood glucose level, glycated haemoglobin, exercise central augmentation pressure normalized to heart rate of 75, and total cholesterol. We were able to report 2578 discriminatory metabolite peaks in the normalized data (p<0.05) as compared to only 1840 metabolite signals in the raw data. Our results support the use of singular value decomposition-based normalization for metabolomics data.