Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ana Stanescu is active.

Publication


Featured researches published by Ana Stanescu.


bioinformatics and biomedicine | 2014

Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

Ana Stanescu; Doina Caragea

Producing accurate classifiers depends on the quality and quantity of labeled data. The lack of labeled data, due to its expensive generation, critically affects the application of machine learning algorithms to biological problems. However, unlabeled data may be acquired relatively faster and in larger quantities thanks to current biochemical technologies, called Next Generation Sequencing. In such cases, when the number of labeled instances is overwhelmed by the number of unlabeled instances, semi-supervised learning represents a cost-effective alternative that can improve supervised classifiers by utilizing unlabeled data. In practice, data oftentimes exhibits imbalanced class distributions, which represents an obstacle for both supervised and semi-supervised learning. The problem of supervised learning from imbalanced datasets has been extensively studied, and various solutions have been proposed to produce classifiers with optimal performance on highly skewed class distributions. In the case of semi-supervised learning, there are not as many efforts aimed at the imbalance data problem. In this paper, we study several ensemble-based semi-supervised learning approaches for predicting splice sites, a problem for which the imbalance ratio is very high. We run experiments on five imbalanced datasets with the goal of identifying which variants are the most effective.


BMC Systems Biology | 2015

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

Ana Stanescu; Doina Caragea

BackgroundRecent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers.ResultsOur experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines.ConclusionsIn the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.


Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) on | 2013

A Hybrid Recommender System: User Profiling from Keywords and Ratings

Ana Stanescu; Swapnil Nagar; Doina Caragea

Over the last decade, user-generated content has grown continuously. Recommender systems that exploit user feedback are widely used in e-commerce and quite necessary for business enhancement. To make use of such user feedback, we propose a new content/collaborative hybrid approach, which is built on top of the recently released hetrec2011-movielens-2k dataset and is an extension of a previously proposed neighborhood based approach, called Weighted Tag Recommender (WTR). Our approach has two versions. Both versions make use of ratings to enable collaborative filtering and use either user tags, available in the hetrec2011-movielens-2k dataset, or movie keywords retrieved from IMDB, to capture movie content information. Experimental results show that the information from keywords can help build a movie recommender system competitive with other neighborhood based approaches and even with more sophisticated state-of-the-art approaches.


data mining in bioinformatics | 2016

Predicting alternatively spliced exons using semi-supervised learning

Ana Stanescu; Karthik Tangirala; Doina Caragea

Cost-efficient next generation sequencers can now produce unprecedented volumes of raw DNA data, posing challenges for annotation. Supervised machine learning approaches have been traditionally used to analyse and annotate complex genomic information. However, such approaches require labelled data for training, which in practice is scarce or expensive, while the unlabelled data is abundant. For some problems, semi-supervised learning can help improve supervised classifiers by making use of large amounts of unlabelled data and the latent information within them. We evaluate the applicability of semi-supervised learning algorithms to the problem of DNA sequence annotation, specifically to the prediction of alternatively spliced exons. We employ Expectation Maximisation, Self-training, and Co-training algorithms in an effort to assess the strengths and limitations of these techniques in the context of alternative splicing.


computational intelligence in bioinformatics and computational biology | 2015

Predicting cassette exons using transductive learning approaches

Ana Stanescu; Doina Caragea

Recent advances in biotechnology have resulted in large volumes of genomic and proteomic data leading to the emergence of numerous in silico methods for annotation, such as supervised machine learning approaches. Such algorithms, however, require large amounts of labeled data for training. In practice, labeled data is oftentimes limited because it is difficult to obtain. Therefore, semi-supervised machine learning is preferable, in which classifiers trained on limited amounts of labeled data can be improved by exploiting the large amounts of unlabeled data. In this work, we focus on transductive learning, a special case of semi-supervised learning. A semi-supervised algorithm builds an inductive model that generalizes well to new, unseen (test) instances. In contrast, during the training phase, a transductive algorithm has access to the (test) instances that need to be classified, allowing advantageous utilization of these points in order to reach the best separation function. Compared to learning a classifier for use with future data, cassette exon identification is a suitable application for transductive learning, since the goal is to annotate a sequenced genome for which a limited amount of labeled data is available. We study the applicability of three popular transductive techniques and their compatibility with various kernels to the binary DNA classification problem of cassette exon identification. The results of our experiments suggest that transductive learning is a useful approach for assisting genome annotation.


International Journal of Bioinformatics Research and Applications | 2017

An empirical study of self-training and data balancing techniques for splice site prediction

Ana Stanescu; Doina Caragea

Thanks to Next Generation Sequencing technologies, unlabelled data is now generated easily, while the annotation process remains expensive. Semi-supervised learning represents a cost-effective alternative to supervised learning, as it can improve supervised classifiers by making use of unlabelled data. However, semi-supervised learning has not been studied much for problems with highly skewed class distributions, which are prevalent in bioinformatics. To address this limitation, we carry out a study of a semi-supervised learning algorithm, specifically self-training based on Naive Bayes, with focus on data-level approaches for handling imbalanced class distributions. Our study is conducted on the problem of predicting splice sites and it is based on datasets for which the ratio of positive to negative examples is 1-to-99. Our results show that under certain conditions semi-supervised learning algorithms are a better choice than purely supervised classification algorithms.


advances in social networks analysis and mining | 2016

Study of transductive learning and unsupervised feature construction methods for biological sequence classification

Ana Stanescu; Karthik Tangirala; Doina Caragea

Next Generation Sequencing (NGS) technologies have led to fast and inexpensive production of large amounts of biological sequence data, including nucleotide sequences and derived protein sequences. These fast-increasing volumes of data pose challenges to computational methods for annotation. Machine learning approaches, primarily supervised algorithms, have been widely used to assist with classification tasks in bioinformatics. However, supervised algorithms rely on large amounts of labeled data in order to produce quality predictors. Oftentimes, labeled data is difficult and expensive to acquire in sufficiently large quantities. When only limited amounts of labeled data but considerably larger amounts of unlabeled data are available for a specific annotation problem, semi-supervised learning approaches represent a cost-effective alternative. In this work, we focus on a special case of semi-supervised learning, namely transductive learning, in which the algorithm has access during the training phase to the instances that need to be labeled. Transduction is particularly suitable for biological sequence classification, where the goal is generally to label a given set of unlabeled instances. However, a challenge that needs to be addressed in this context consists of identification of compact sets of informative features. Given the lack of labeled data, standard supervised feature selection methods may result in unreliable features. Therefore, we study recently proposed unsupervised feature construction approaches together with transductive learning. Experimental results on two classification problems, namely cassette exon identification and protein localization, show that the unsupervised features result in better performance than the supervised features.


international conference on bioinformatics | 2016

SEMI-SUPERVISED LEARNING OF ALTERNATIVELY SPLICED EXONS USING EXPECTATION MAXIMIZATION TYPE APPROACHES

Ana Stanescu; Doina Caragea


international conference on knowledge discovery and information retrieval | 2018

Cross-domain Sentiment Classification using an Adapted Naïve Bayes Approach and Features Derived from Syntax Trees

Srilaxmi Cheeti; Ana Stanescu; Doina Caragea


web intelligence | 2013

A Hybrid Recommender System: User Profiling from Keywords and Ratings.

Ana Stanescu; Swapnil Nagar; Doina Caragea

Collaboration


Dive into the Ana Stanescu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge