Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Gelio Alves is active.

Publication


Featured researches published by Gelio Alves.


Journal of Proteome Research | 2008

Enhancing Peptide Identification Confidence by Combining Search Methods

Gelio Alves; Wells W. Wu; Guanghui Wang; Rong-Fong Shen; Yi-Kuo Yu

Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request.


Bioinformatics | 2005

Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics

Gelio Alves; Yi-Kuo Yu

MOTIVATION The key to MS -based proteomics is peptide sequencing. The major challenge in peptide sequencing, whether library search or de novo, is to better infer statistical significance and better attain noise reduction. Since the noise in a spectrum depends on experimental conditions, the instrument used and many other factors, it cannot be predicted even if the peptide sequence is known. The characteristics of the noise can only be uncovered once a spectrum is given. We wish to overcome such issues. RESULTS We designed RAId to identify peptides from their associated tandem mass spectrometry data. RAId performs a novel de novo sequencing followed by a search in a peptide library that we created. Through de novo sequencing, we establish the spectrum-specific background score statistics for the library search. When the database search fails to return significant hits, the top-ranking de novo sequences become potential candidates for new peptides that are not yet in the database. The use of spectrum-specific background statistics seems to enable RAId to perform well even when the spectral quality is marginal. Other important features of RAId include its potential in de novo sequencing alone and the ease of incorporating post-translational modifications.


Biology Direct | 2007

RAId_DbS: Peptide Identification using Database Searches with Realistic Statistics

Gelio Alves; Aleksey Y. Ogurtsov; Yi-Kuo Yu

BackgroundThe key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.ResultsUsing a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Students t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.


BMC Genomics | 2008

RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration

Gelio Alves; Aleksey Y. Ogurtsov; Yi-Kuo Yu

BackgroundExisting scientific literature is a rich source of biological information such as disease markers. Integration of this information with data analysis may help researchers to identify possible controversies and to form useful hypotheses for further validations. In the context of proteomics studies, individualized proteomics era may be approached through consideration of amino acid substitutions/modifications as well as information from disease studies. Integration of such information with peptide searches facilitates speedy, dynamic information retrieval that may significantly benefit clinical laboratory studies.DescriptionWe have integrated from various sources annotated single amino acid polymorphisms, post-translational modifications, and their documented disease associations (if they exist) into one enhanced database per organism. We have also augmented our peptide identification software RAId_DbS to take into account this information while analyzing a tandem mass spectrum. In principle, one may choose to respect or ignore the correlation of amino acid polymorphisms/modifications within each protein. The former leads to targeted searches and avoids scoring of unnecessary polymorphism/modification combinations; the latter explores possible polymorphisms in a controlled fashion. To facilitate new discoveries, RAId_DbS also allows users to conduct searches permitting novel polymorphisms as well as to search a knowledge database created by the users.ConclusionWe have finished constructing enhanced databases for 17 organisms. The web link to RAId_DbS and the enhanced databases is http://www.ncbi.nlm.nih.gov/CBBResearch/qmbp/RAId_DbS/index.html. The relevant databases and binaries of RAId_DbS for Linux, Windows, and Mac OS X are available for download from the same web page.


PLOS ONE | 2010

RAId_aPS: MS/MS Analysis with Multiple Scoring Functions and Spectrum-Specific Statistics

Gelio Alves; Aleksey Y. Ogurtsov; Yi-Kuo Yu

Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an -value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic -value reported by any method into the textbook-defined -value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific -values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign -values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign -values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.


Journal of Proteome Research | 2013

Improving Peptide Identification Sensitivity in Shotgun Proteomics by Stratification of Search Space

Gelio Alves; Yi-Kuo Yu

Because of its high specificity, trypsin is the enzyme of choice in shotgun proteomics. Nonetheless, several publications do report the identification of semitryptic and nontryptic peptides. Many of these peptides are thought to be signaling peptides or to have formed during sample preparation. It is known that only a small fraction of tandem mass spectra from a trypsin-digested protein mixture can be confidently matched to tryptic peptides. If other possibilities such as post-translational modifications and single-amino acid polymorphisms are ignored, this suggests that many unidentified spectra originate from semitryptic and nontryptic peptides. To include them in database searches, however, may not improve overall peptide identification because of the possible sensitivity reduction from search space expansion. To circumvent this issue for E-value-based search methods, we have designed a scheme that categorizes qualified peptides (i.e., peptides whose differences in molecular weight from the parent ion are within a specified error tolerance) into three tiers: tryptic, semitryptic, and nontryptic. This classification allows peptides that belong to different tiers to have different Bonferroni correction factors. Our results show that this scheme can significantly improve retrieval performance compared to those of search strategies that assign equal Bonferroni correction factors to all qualified peptides.


Journal of Proteomics | 2011

Assigning statistical significance to proteotypic peptides via database searches

Gelio Alves; Aleksey Y. Ogurtsov; Yi-Kuo Yu

Querying MS/MS spectra against a database containing only proteotypic peptides reduces data analysis time due to reduction of database size. Despite the speed advantage, this search strategy is challenged by issues of statistical significance and coverage. The former requires separating systematically significant identifications from less confident identifications, while the latter arises when the underlying peptide is not present, due to single amino acid polymorphisms (SAPs) or post-translational modifications (PTMs), in the proteotypic peptide libraries searched. To address both issues simultaneously, we have extended RAIds knowledge database to include proteotypic information, utilized RAIds statistical strategy to assign statistical significance to proteotypic peptides, and modified RAIds programs to allow for consideration of proteotypic information during database searches. The extended database alleviates the coverage problem since all annotated modifications, even those that occurred within proteotypic peptides, may be considered. Taking into account the likelihoods of observation, the statistical strategy of RAId provides accurate E-value assignments regardless whether a candidate peptide is proteotypic or not. The advantage of including proteotypic information is evidenced by its superior retrieval performance when compared to regular database searches.


PLOS ONE | 2014

Accuracy evaluation of the unified P-value from combining correlated P-values.

Gelio Alves; Yi-Kuo Yu

Meta-analysis methods that combine -values into a single unified -value are frequently employed to improve confidence in hypothesis testing. An assumption made by most meta-analysis methods is that the -values to be combined are independent, which may not always be true. To investigate the accuracy of the unified -value from combining correlated -values, we have evaluated a family of statistical methods that combine: independent, weighted independent, correlated, and weighted correlated -values. Statistical accuracy evaluation by combining simulated correlated -values showed that correlation among -values can have a significant effect on the accuracy of the combined -value obtained. Among the statistical methods evaluated those that weight -values compute more accurate combined -values than those that do not. Also, statistical methods that utilize the correlation information have the best performance, producing significantly more accurate combined -values. In our study we have demonstrated that statistical methods that combine -values based on the assumption of independence can produce inaccurate -values when combining correlated -values, even when the -values are only weakly correlated. Therefore, to prevent from drawing false conclusions during hypothesis testing, our study advises caution be used when interpreting the -value obtained from combining -values of unknown correlation. However, when the correlation information is available, the weighting-capable statistical method, first introduced by Brown and recently modified by Hou, seems to perform the best amongst the methods investigated.


PLOS ONE | 2011

Combining independent, weighted P-values: achieving computational stability by a systematic expansion with controllable accuracy.

Gelio Alves; Yi-Kuo Yu

Given the expanding availability of scientific data and tools to analyze them, combining different assessments of the same piece of information has become increasingly important for social, biological, and even physical sciences. This task demands, to begin with, a method-independent standard, such as the -value, that can be used to assess the reliability of a piece of information. Goods formula and Fishers method combine independent -values with respectively unequal and equal weights. Both approaches may be regarded as limiting instances of a general case of combining -values from groups; -values within each group are weighted equally, while weight varies by group. When some of the weights become nearly degenerate, as cautioned by Good, numeric instability occurs in computation of the combined -values. We deal explicitly with this difficulty by deriving a controlled expansion, in powers of differences in inverse weights, that provides both accurate statistics and stable numerics. We illustrate the utility of this systematic approach with a few examples. In addition, we also provide here an alternative derivation for the probability distribution function of the general case and show how the analytic formula obtained reduces to both Goods and Fishers methods as special cases. A C++ program, which computes the combined -values with equal numerical stability regardless of whether weights are (nearly) degenerate or not, is available for download at our group website http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/CoinedPValues.html.


Bioinformatics | 2015

Mass spectrometry based protein identification with accurate statistical significance assignment

Gelio Alves; Yi-Kuo Yu

MOTIVATION Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, even though accurate statistics for peptide identification can now be achieved, accurate protein level statistics remain challenging. RESULTS We have constructed a protein ID method that combines peptide evidences of a candidate protein based on a rigorous formula derived earlier; in this formula the database P-value of every peptide is weighted, prior to the final combination, according to the number of proteins it maps to. We have also shown that this protein ID method provides accurate protein level E-value, eliminating the need of using empirical post-processing methods for type-I error control. Using a known protein mixture, we find that this protein ID method, when combined with the Sorić formula, yields accurate values for the proportion of false discoveries. In terms of retrieval efficacy, the results from our method are comparable with other methods tested. AVAILABILITY AND IMPLEMENTATION The source code, implemented in C++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit.

Collaboration


Dive into the Gelio Alves's collaboration.

Top Co-Authors

Avatar

Yi-Kuo Yu

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Aleksey Y. Ogurtsov

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Guanghui Wang

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Rong-Fong Shen

Center for Biologics Evaluation and Research

View shared research outputs
Top Co-Authors

Avatar

Timothy P. Doerr

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Wells W. Wu

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

David B. Sacks

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Marjan Gucek

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Steven K. Drake

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Alex Rubio

National Institutes of Health

View shared research outputs
Researchain Logo
Decentralizing Knowledge